Spaces:

Imaginethat
/

aOt

Paused

App Files Files Community

Imaginethat commited on Dec 26, 2025

Commit

8a11f7f

verified ·

1 Parent(s): 65cddc5

Upload 68 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
README.md +92 -33
assets/avocado.ico +0 -0
assets/case_1.mp4 +3 -0
assets/case_2.png +3 -0
environment.yml +327 -0
eval_scripts/DREAM-1K/dream_example.jsonl +0 -0
eval_scripts/DREAM-1K/eval_DREAM-1K.sh +12 -0
eval_scripts/DREAM-1K/generate_caption.py +91 -0
eval_scripts/DREAM-1K/tarsier/LICENSE +201 -0
eval_scripts/DREAM-1K/tarsier/configs/tarser2_default_config.yaml +14 -0
eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/multi_images_parser.py +199 -0
eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/object_tracking_parser.py +160 -0
eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/standard_vision_parser.py +255 -0
eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils.py +452 -0
eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils_visualize.py +54 -0
eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/video_permutation_parser.py +137 -0
eval_scripts/DREAM-1K/tarsier/dataset/tarsier_datamodule.py +280 -0
eval_scripts/DREAM-1K/tarsier/dataset/tarsier_processor.py +240 -0
eval_scripts/DREAM-1K/tarsier/dataset/utils.py +186 -0
eval_scripts/DREAM-1K/tarsier/evaluation/evaluate.py +177 -0
eval_scripts/DREAM-1K/tarsier/evaluation/metrics/__init__.py +5 -0
eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_caption_cider.py +82 -0
eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_dream_gpt.py +436 -0
eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_mc.py +159 -0
eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_oe_gpt.py +153 -0
eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_video_mme.py +358 -0
eval_scripts/DREAM-1K/tarsier/models/modeling_qwen2_vl_fast.py +1320 -0
eval_scripts/DREAM-1K/tarsier/models/modeling_tarsier.py +502 -0
eval_scripts/DREAM-1K/tarsier/models/utils.py +17 -0
eval_scripts/DREAM-1K/tarsier/scripts/run_demo_cli.sh +15 -0
eval_scripts/DREAM-1K/tarsier/scripts/run_demo_gradio.sh +9 -0
eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh +12 -0
eval_scripts/DREAM-1K/tarsier/scripts/run_inference_benchmark.sh +80 -0
eval_scripts/DREAM-1K/tarsier/scripts/run_inference_caption.sh +79 -0
eval_scripts/DREAM-1K/tarsier/tasks/demo_cli.py +116 -0
eval_scripts/DREAM-1K/tarsier/tasks/demo_gradio.py +230 -0
eval_scripts/DREAM-1K/tarsier/tasks/inference_benchmark.py +197 -0
eval_scripts/DREAM-1K/tarsier/tasks/inference_caption.py +165 -0
eval_scripts/DREAM-1K/tarsier/tasks/inference_quick_start.py +91 -0
eval_scripts/DREAM-1K/tarsier/tasks/utils.py +45 -0
eval_scripts/DREAM-1K/tarsier/tools/color.py +36 -0
eval_scripts/DREAM-1K/tarsier/tools/conversation.py +256 -0
eval_scripts/DREAM-1K/tarsier/tools/ptbtokenizer.py +66 -0
eval_scripts/DREAM-1K/tarsier/tools/rw_utils.py +64 -0
eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh +62 -0
eval_scripts/Daily-Omni/analysis.py +18 -0
eval_scripts/Daily-Omni/evaluation.py +225 -0
eval_scripts/Daily-Omni/generate_caption.py +142 -0
eval_scripts/Daily-Omni/grouped_data.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/case_1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/case_2.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,48 +1,107 @@
----
-title: Avocado On Toast
-emoji: 🥑
-colorFrom: green
-colorTo: yellow
-sdk: gradio
-app_file: app.py
-pinned: false
 ---
-# Avocado On Toast
-Gradio Space that standardizes uploaded videos with ffmpeg, runs AVoCaDO, and
-writes JSONL outputs into a run-labeled dataset folder.
-## Usage (Hugging Face Space)
-1. Create a new Gradio Space and connect it to this repo.
-2. Ensure `ffmpeg` and AVoCaDO are available in the Space image.
-3. Set `AVOCADO_CMD` as a Space secret or enter it in the UI textbox.
-   - The command must accept `{input}` and `{output}` placeholders.
-   - Example:
-     ```bash
-     python -m avocado_captioner.cli --input {input} --output {output}
-     ```
-4. Launch the Space, upload videos, and click **Process videos**.
-Outputs are written to:
 ```
-/data/dataset/runs/<run_label>/annotations.jsonl
 ```
-Each run produces `annotations.jsonl` and `manifest.json` under the run label.
-## Configuration
-- `DATA_ROOT` (default: `/data`) controls the root folder for uploads and output.
-- `AVOCADO_CMD` defines the AVoCaDO command used during processing.
-## Notes
-The standardization step uses:
-- 720p scaling (`scale=-2:720`)
-- H.264 video (`libx264`, `crf=23`)
-- AAC audio (`128k`)
-Adjust these settings in `app.py` if you want a different quality/compute tradeoff.

+# <img src="assets/avocado.ico" alt="AVoCaDO icon" width="28px"> AVoCaDO: An <u>A</u>udio<u>V</u>isual Vide<u>o</u> <u>Ca</u>ptioner <u>D</u>riven by Temporal <u>O</u>rchestration
+<p align="left">
+  <a href="https://avocado-captioner.github.io/"><img src="https://img.shields.io/badge/Project%20webpage-558b2f?style=for-the-badge"></a>
+  <a href="https://huggingface.co/AVoCaDO-Captioner/AVoCaDO"><img src="https://img.shields.io/badge/Model-db8905?style=for-the-badge"></a>
+  <a href="https://arxiv.org/abs/2510.10395"><img src="https://img.shields.io/badge/arXiv-red?style=for-the-badge"></a>
+</p>
 ---
+## ✨ Overview
+Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. We introduce <b>AVoCaDO</b>, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance under visual-only settings.
+## 🎬 Captioning Case of AVoCaDO
+<img src="assets/case_2.png" alt="AVoCaDO caption">
+An illustration of a video caption generated by AVoCaDO, featuring both <b>precise audiovisual temporal alignment</b> and <u>accurate dialogue rendering</u>.
+## 🚀 Getting Started
+Follow these simple steps to set up and run AVoCaDO on your machine.
+### 1. Clone the repository
+First, clone the project and navigate into the directory:
+```bash
+git clone https://github.com/AVoCaDO-Captioner/AVoCaDO.git
+cd AVoCaDO
+```
+### 2. Set Up the Environment
+Create and activate the Conda environment using the provided ``environment.yml`` file.
+```bash
+conda env create -f environment.yml
+conda activate AVoCaDO
 ```
+### 3. Quick Usage
+```python
+python inference.py assets/case_1.mp4
 ```
+## 📈 Benchmark Evaluation
+We provide evaluation scripts for all evaluated benchmarks in our paper.
+### Direct Audiovisual Caption Evaluation
+1. **video-SALMONN2-testset:**
+    ```bash
+    bash eval_scripts/video-SALMONN2-testset/eval_video-SALMONN2-test.sh <your_save_directory>
+    ```
+2. **UGC-VideoCap:**
+    ```bash
+    bash eval_scripts/UGC-VideoCap/eval_UGC-VideoCap.sh <your_save_directory>
+    ```
+### QA-based Audiovisual Caption Evaluation
+1. **Daily-Omni:**
+    ```bash
+    bash eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh <your_save_directory>
+    ```
+2. **WorldSense:**
+    ```bash
+    bash eval_scripts/WorldSense/WorldSense_pipeline.sh <your_save_directory>
+    ```
+### Visual-only Caption Evaluation
+1. **VDC:**
+    First, generate captions for the videos in the VDC benchmark.
+    ```python
+    python eval_scripts/VDC/generate_caption.py \
+        --model_path <path_to_AVoCaDO> \
+        --fout_path <your_save_path>
+    ```
+    Next, set up the judge server. This requires installing [SGLang](https://github.com/sgl-project/sglang) to deploy the [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) as the judge model.
+    ```python
+    # Deploy the judge model using SGLang
+    python -m sglang.launch_server \
+        --model-path path_to_Meta-Llama-3.1-8B-Instruct \
+        --port 30000 \
+        --dp 2 --tp 4
+    ```
+    Once the judge model is successfully deployed and running, you can start the evaluation.
+    ```bash
+    bash AVoCaDO/eval_scripts/VDC/evaluation.sh <your_save_path>
+    ```
+2. **DREAM-1K:**
+    ```bash
+    bash eval_scripts/DREAM-1K/eval_DREAM-1K.sh <your_save_directory>
+    ```
+## ✒️ Citation
+If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!
+```bibtex
+@article{chen2025avocado,
+  title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
+  author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
+  journal={arXiv preprint arXiv:2510.10395},
+  year={2025}
+}
+```

assets/avocado.ico ADDED Viewed

assets/case_1.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:78c17b195eae977e5ffdeafb499fbeeec2ea50f9258973154c0da111c2b90b07
+size 6417271

assets/case_2.png ADDED Viewed

Git LFS Details

SHA256: 46159b25d4560ca19ab7cbe605de1762e71ce1fc3c4f2ed72321af1b268fe2bc
Pointer size: 132 Bytes
Size of remote file: 2.56 MB

environment.yml ADDED Viewed

	@@ -0,0 +1,327 @@

+name: AVoCaDO
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - bzip2=1.0.8=h5eee18b_6
+  - ca-certificates=2024.9.24=h06a4308_0
+  - ld_impl_linux-64=2.40=h12ee557_0
+  - libffi=3.4.4=h6a678d5_1
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - libuuid=1.41.5=h5eee18b_0
+  - mscorefonts=0.0.1=3
+  - ncurses=6.4=h6a678d5_0
+  - openssl=3.0.15=h5eee18b_0
+  - pip=24.2=py310h06a4308_0
+  - python=3.10.15=he870216_1
+  - readline=8.2=h5eee18b_0
+  - setuptools=75.1.0=py310h06a4308_0
+  - sqlite=3.45.3=h5eee18b_0
+  - tk=8.6.14=h39e8969_0
+  - wheel=0.44.0=py310h06a4308_0
+  - xz=5.4.6=h5eee18b_1
+  - zlib=1.2.13=h5eee18b_1
+  - pip:
+      - accelerate==0.34.1
+      - aiodns==3.2.0
+      - aiohappyeyeballs==2.4.3
+      - aiohttp==3.10.10
+      - aiosignal==1.3.1
+      - annotated-types==0.7.0
+      - antlr4-python3-runtime==4.11.0
+      - anyio==4.6.2.post1
+      - apscheduler==3.10.4
+      - asttokens==2.4.1
+      - async-timeout==4.0.3
+      - attrs==24.2.0
+      - audioread==3.0.1
+      - av==13.1.0
+      - backcall==0.2.0
+      - beartype==0.19.0
+      - beautifulsoup4==4.13.4
+      - bleach==6.2.0
+      - blinker==1.9.0
+      - cachetools==4.2.4
+      - cchardet==2.1.7
+      - certifi==2021.10.8
+      - cffi==1.17.1
+      - charset-normalizer==3.4.0
+      - cityhash==0.2.4.post11
+      - click==8.1.7
+      - cloudpickle==3.1.0
+      - colorama==0.4.6
+      - compressed-tensors==0.8.0
+      - contourpy==1.3.1
+      - cycler==0.12.1
+      - dapr==1.14.0
+      - dapr-ext-fastapi==1.14.0
+      - dask==2024.12.1
+      - datasets==3.1.0
+      - dbpool==1.2.1
+      - decorator==4.4.2
+      - decord==0.6.0
+      - deepspeed==0.16.2
+      - defusedxml==0.7.1
+      - dill==0.4.0
+      - diskcache==5.6.3
+      - distro==1.9.0
+      - docopt==0.6.2
+      - docstring-parser==0.16
+      - dsc-auth==0.1.18
+      - einops==0.8.0
+      - et-xmlfile==2.0.0
+      - exceptiongroup==1.2.2
+      - executing==2.1.0
+      - fastapi==0.115.5
+      - fastjsonschema==2.21.1
+      - ffmpeg-python==0.2.0
+      - filelock==3.13.1
+      - fire==0.7.0
+      - flash-attn==2.7.0.post2
+      - flask==3.1.0
+      - fonttools==4.55.0
+      - frozenlist==1.5.0
+      - fsspec==2024.2.0
+      - ftfy==6.3.1
+      - func-timeout==4.3.5
+      - future==1.0.0
+      - fvcore==0.1.5.post20221221
+      - gguf==0.10.0
+      - google-api-core==2.23.0
+      - google-auth==2.36.0
+      - google-cloud-aiplatform==1.71.1
+      - google-cloud-bigquery==3.27.0
+      - google-cloud-core==2.4.1
+      - google-cloud-resource-manager==1.13.1
+      - google-cloud-storage==2.18.2
+      - google-crc32c==1.6.0
+      - google-resumable-media==2.7.2
+      - googleapis-common-protos==1.66.0
+      - grpc-google-iam-v1==0.13.1
+      - grpcio==1.68.0
+      - grpcio-reflection==1.48.2
+      - grpcio-status==1.68.0
+      - h11==0.14.0
+      - h5py==3.12.1
+      - hf-xet==1.1.4
+      - hickle==5.0.3
+      - hiredis==2.4.0
+      - hjson==3.1.0
+      - httpcore==1.0.6
+      - httptools==0.6.4
+      - httpx==0.27.2
+      - huggingface-hub==0.33.0
+      - humanize==4.11.0
+      - icecream==2.1.3
+      - idna==3.10
+      - imageio==2.36.0
+      - imageio-ffmpeg==0.5.1
+      - importlib-metadata==8.5.0
+      - infra-component==1.4.7
+      - infra-framework==1.17.10
+      - infra-kconf==1.1.3
+      - infra-kess==1.1.5
+      - infra-keycenter==1.1.1
+      - infra-storage==1.3.1
+      - install==1.3.5
+      - interegular==0.3.3
+      - iopath==0.1.10
+      - ipdb==0.13.13
+      - ipython==8.12.3
+      - itsdangerous==2.2.0
+      - jedi==0.19.2
+      - jinja2==3.1.3
+      - jiter==0.7.0
+      - joblib==1.4.2
+      - jsonschema==4.23.0
+      - jsonschema-specifications==2024.10.1
+      - jupyter-client==8.6.3
+      - jupyter-core==5.8.1
+      - jupyterlab-pygments==0.3.0
+      - kazoo==2.10.0
+      - kiwisolver==1.4.7
+      - ks-kafka-python==2.0.3
+      - lark==1.2.2
+      - lazy-loader==0.4
+      - levenshtein==0.26.1
+      - librosa==0.11.0
+      - llvmlite==0.43.0
+      - lm-format-enforcer==0.10.6
+      - locket==1.0.0
+      - lxml==4.9.4
+      - lz4==3.1.10
+      - markupsafe==2.1.5
+      - matplotlib==3.10.0
+      - matplotlib-inline==0.1.7
+      - mergedeep==1.3.4
+      - mistral-common==1.5.1
+      - mistral-inference==1.5.0
+      - mistune==3.1.3
+      - moviepy==1.0.3
+      - mpmath==1.3.0
+      - msgpack==1.1.0
+      - msgpack-numpy==0.4.8
+      - msgspec==0.18.6
+      - multidict==6.1.0
+      - multiprocess==0.70.18
+      - mysql-connector-python==8.0.31
+      - nbclient==0.10.2
+      - nbconvert==7.16.6
+      - nbformat==5.10.4
+      - nest-asyncio==1.6.0
+      - networkx==3.2.1
+      - ninja==1.11.1.3
+      - numba==0.60.0
+      - numpy==1.26.3
+      - nvidia-cublas-cu12==12.4.5.8
+      - nvidia-cuda-cupti-cu12==12.4.127
+      - nvidia-cuda-nvrtc-cu12==12.4.127
+      - nvidia-cuda-runtime-cu12==12.4.127
+      - nvidia-cudnn-cu12==9.1.0.70
+      - nvidia-cufft-cu12==11.2.1.3
+      - nvidia-curand-cu12==10.3.5.147
+      - nvidia-cusolver-cu12==11.6.1.9
+      - nvidia-cusparse-cu12==12.3.1.170
+      - nvidia-cusparselt-cu12==0.6.2
+      - nvidia-ml-py==12.560.30
+      - nvidia-nccl-cu12==2.21.5
+      - nvidia-nvjitlink-cu12==12.4.127
+      - nvidia-nvtx-cu12==12.4.127
+      - nvitop==1.3.2
+      - omegaconf==2.3.0
+      - open-clip-torch==2.29.0
+      - openai==1.54.3
+      - opencv-python==4.10.0.84
+      - opencv-python-headless==4.10.0.84
+      - openpyxl==3.1.5
+      - outlines==0.0.46
+      - packaging==24.1
+      - pandas==2.2.3
+      - pandocfilters==1.5.1
+      - parameterized==0.9.0
+      - parso==0.8.4
+      - partd==1.4.2
+      - partial-json-parser==0.2.1.1.post4
+      - pathos==0.3.4
+      - peft==0.14.0
+      - pexpect==4.9.0
+      - pickleshare==0.7.5
+      - pillow==10.4.0
+      - pipreqs==0.5.0
+      - platformdirs==4.3.6
+      - pooch==1.8.2
+      - portalocker==3.0.0
+      - pox==0.3.6
+      - ppft==1.7.7
+      - prettytable==2.5.0
+      - proglog==0.1.10
+      - prometheus-client==0.21.0
+      - prometheus-fastapi-instrumentator==7.0.0
+      - prompt-toolkit==3.0.48
+      - propcache==0.2.0
+      - proto-plus==1.25.0
+      - protobuf==3.20.0
+      - psutil==6.1.0
+      - ptyprocess==0.7.0
+      - pure-eval==0.2.3
+      - py-cpuinfo==9.0.0
+      - pyairports==2.1.1
+      - pyarrow==18.0.0
+      - pyasn1==0.6.1
+      - pyasn1-modules==0.4.1
+      - pycares==4.4.0
+      - pycocoevalcap==1.2
+      - pycocotools==2.0.8
+      - pycountry==24.6.1
+      - pycparser==2.22
+      - pycryptodome==3.21.0
+      - pydantic==2.9.2
+      - pydantic-core==2.23.4
+      - pydub==0.25.1
+      - pygments==2.18.0
+      - pyparsing==3.2.0
+      - pysmhasher==0.2.5
+      - pysoundfile==0.9.0.post1
+      - python-dateutil==2.9.0.post0
+      - python-dotenv==1.0.1
+      - python-levenshtein==0.26.1
+      - python-snappy==0.6.1
+      - pytorchvideo==0.1.5
+      - pytube==15.0.0
+      - pytz==2021.3
+      - pytz-deprecation-shim==0.1.0.post0
+      - pyyaml==6.0.2
+      - pyzmq==26.2.0
+      - qwen-omni-utils==0.0.8
+      - qwen-vl-utils==0.0.8
+      - rapidfuzz==3.12.2
+      - ray==2.38.0
+      - redis==4.6.0
+      - referencing==0.35.1
+      - regex==2024.9.11
+      - requests==2.32.3
+      - rpds-py==0.21.0
+      - rsa==4.9
+      - safetensors==0.4.5
+      - scenedetect==0.6.4
+      - scikit-learn==1.6.0
+      - scipy==1.14.1
+      - seaborn==0.13.2
+      - sentencepiece==0.2.0
+      - setuptools-scm==8.1.0
+      - shapely==2.0.6
+      - simple-parsing==0.1.6
+      - six==1.16.0
+      - sniffio==1.3.1
+      - soundfile==0.13.1
+      - soupsieve==2.7
+      - soxr==0.5.0.post1
+      - sqlparse==0.4.4
+      - stack-data==0.6.3
+      - starlette==0.41.3
+      - sympy==1.13.1
+      - tabulate==0.9.0
+      - termcolor==2.5.0
+      - threadpoolctl==3.5.0
+      - tiktoken==0.7.0
+      - timm==1.0.12
+      - tinycss2==1.4.0
+      - tokenizers==0.21.1
+      - tomli==2.0.2
+      - toolz==1.0.0
+      - torch==2.6.0
+      - torchvision==0.21.0
+      - tornado==6.4.1
+      - tqdm==4.67.0
+      - traitlets==5.14.3
+      - transformers==4.52.3
+      - triton==3.2.0
+      - typeguard==4.4.1
+      - typing-extensions==4.12.2
+      - tzdata==2024.2
+      - tzlocal==4.3.1
+      - unpaddedbase64==2.1.0
+      - urllib3==1.26.20
+      - uvicorn==0.32.0
+      - uvloop==0.21.0
+      - vertexai==1.71.1
+      - vllm==0.6.3
+      - watchfiles==0.24.0
+      - wcwidth==0.2.13
+      - webencodings==0.5.1
+      - websockets==13.1
+      - werkzeug==3.1.3
+      - word2number==1.1
+      - xformers==0.0.27.post2
+      - xmltodict==0.12.0
+      - xxhash==3.5.0
+      - yacs==0.1.8
+      - yarg==0.1.9
+      - yarl==1.17.1
+      - zhon==2.1.1
+      - zipp==3.20.2
+      - zsvision==0.7.12

eval_scripts/DREAM-1K/dream_example.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

eval_scripts/DREAM-1K/eval_DREAM-1K.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+MODEL_PATH="path_to_AVoCaDO" # TODO
+OUTPUT_DIR="$1"
+mkdir -p "$OUTPUT_DIR"
+python eval_scripts/DREAM-1K/generate_caption.py \
+    --model_path "$MODEL_PATH" \
+    --save_path "$OUTPUT_DIR/model_caption.jsonl"
+bash eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh "$OUTPUT_DIR/model_caption.jsonl"

eval_scripts/DREAM-1K/generate_caption.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+VIDEO_MAX_PIXELS = 401408  # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400  # 512*28*28*50
+USE_AUDIO_IN_VIDEO = False
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+script_dir = Path(__file__).resolve().parent
+example_path = script_dir / "dream_example.jsonl"
+video_dir = "" # TODO
+parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+parser.add_argument("--save_path", type=str, required=True, help="Path to save the evaluation results.")
+args = parser.parse_args()
+model_path = args.model_path
+fout_path = args.save_path
+f_example = open(example_path, 'r', encoding='utf-8')
+fout = open(fout_path, 'w', encoding='utf-8')
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+    model_path,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    attn_implementation="flash_attention_2",
+)
+model.disable_talker()
+processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+def chat(data):
+    conversation = [
+        {
+            "role": "system",
+            "content": [
+                {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+            ],
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "video",
+                    "video": data["video_path"],
+                    "max_pixels": VIDEO_MAX_PIXELS,
+                },
+                {
+                    "type": "text",
+                    "text": data["question"]
+                },
+            ],
+        },
+    ]
+    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+    audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+    inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+    inputs = inputs.to(model.device).to(model.dtype)
+    text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=2048)
+    text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    model_generation = text.split("\nassistant\n")[-1]
+    return model_generation
+for idx, line in tqdm(enumerate(f_example, start=1)):
+    data = json.loads(line)
+    video_path = os.path.join(video_dir, data["messages"][0]["content"][0]["video"]["video_file"])
+    question = "Imagine the video from these frames and describe it in detail."
+    temp_data = {
+        "video_path": video_path,
+        "question": question,
+        }
+    with torch.inference_mode():
+        response = chat(temp_data)
+        out_data = data
+        data["messages"][0]["content"][1]["text"] = question
+        out_data["messages"][1]["content"][0]["text"] = response
+        fout.write(json.dumps(out_data, ensure_ascii=False) + '\n')
+        fout.flush()

eval_scripts/DREAM-1K/tarsier/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

eval_scripts/DREAM-1K/tarsier/configs/tarser2_default_config.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+max_n_frames: 256
+n_frames: 16
+max_pixels: 460800 # 1280 * 720 // 2
+min_pixels: 0
+max_seq_len: 16384
+is_training: false  # 会影响：1. 训练和测试时采帧不同；2. 测试时忽略 response。
+print_data_error: true
+is_training: false
+do_image_padding: false
+do_image_crop: false
+do_image_resize: false
+video_sampling_strategy: {'video_sampler_version': 'v1', 'force_frames_n_divisible': 1, 'use_multi_images_for_video': true}
+prompt: ""
+train_task: sft

eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/multi_images_parser.py ADDED Viewed

	@@ -0,0 +1,199 @@

+from typing import Dict, List
+import random
+import re
+from PIL import Image
+from .utils import sample_video, read_image
+class MultiImagesParser:
+    def __init__(
+        self,
+        n_frames=8,
+        is_training=True,
+    ):
+        self.n_frames = n_frames
+        self.is_training = is_training
+        # fmt: off
+        self.data_temp = {
+            "text": [
+                [{
+                    "prompt": "Describe the image in short.",
+                    "response": "A rollerblader rides high in a full pipe while others watch"
+                }],
+                [{
+                    "prompt": "Describe the image in short.",
+                    "response": "A woman in winter clothes is on the sidewalk with a phone."
+                }]
+            ],
+            "image": [
+                {
+                    "image_file": "/mnt/bn/videonaslq/images/flickr30k/images/3371533654.jpg"
+                },
+                {
+                    "image_file": "/mnt/bn/videonaslq/images/coco/train2014/COCO_train2014_000000177950.jpg"
+                },
+                {
+                    "video_file": "/mnt/bn/llmdatalq/jiangnan/video_generation/webvid_10M_download/20230609/videos/011851_011900/1047443473.mp4",
+                    "frame_indices": [0, 85, 171, 256, 342, 427, 513, 598]
+                }
+            ],
+            "dataset": "coco",
+            "task": "multi_images",
+            "image_processing_config": {},
+        }
+        # fmt: on
+    def check_format(self, data_dict: Dict, image_processing_config: Dict):
+        assert data_dict['dataset'] in ['coco', 'sharegpt4v_cap100k', 'sharegpt4v_mix665k', 'webvid', 'movie'], data_dict
+        # 目前多图数据应该没有包含坐标的数据吧
+        if image_processing_config.get('has_coordinates', False):
+            raise ValueError(f'do_crop and has_coordinates cannot be True at the same time in MultiImagesParser!')
+        # 检查是否能匹配到坐标
+        texts = data_dict['text']
+        for text in texts:
+            match = re.search(r'\[(\d+(\.\d+)?,\s*)+\d+(\.\d+)?\]', text['prompt'] + text['response'])
+            if match:
+                print(f'[Warning] 疑似检测到包含坐标的数据：{data_dict}')
+    def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+        self.check_format(data_dict, image_processing_config)
+        # shuffle
+        texts = data_dict['text']
+        images = data_dict['image']
+        images = self.load_images(images)
+        idxs = list(range(len(texts)))
+        random.shuffle(idxs)
+        texts = [texts[i] for i in idxs]
+        images = [images[i] for i in idxs]
+        # sample n_frames
+        if isinstance(self.n_frames, int):
+            n_frames = random.choice(list(range(1, self.n_frames + 1)))
+        else:
+            n_frames = random.choice(self.n_frames)
+        texts = texts[: n_frames]
+        images = images[: n_frames]
+        dataset = data_dict['dataset']
+        if dataset in ['coco', 'sharegpt4v_cap100k', 'webvid', 'movie']:
+            prompt, response = self.transform_for_caption_task(texts, dataset, images)
+        else:
+            prompt, response = self.transform_for_qa_task(texts, dataset, images)
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    *[{"type": "image", "image": img} for img in images],
+                    {"type": "text", "text": prompt},
+                ]
+            },
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "text", "text": response}
+                ]
+            }
+        ]
+        return messages
+    def transform_for_caption_task(self, texts, dataset, images):
+        idx = random.choice(list(range(len(texts))))
+        if dataset == 'coco':
+            if len(texts) == 1:
+                prompt = 'Describe the image in short.'
+            else:
+                prompt = f'Describe the images starting from frame {idx + 1} in short in order.'
+        elif dataset == 'sharegpt4v_cap100k':
+            if len(texts) == 1:
+                prompt = 'Describe the image in detail.'
+            else:
+                prompt = f'Describe the images starting from frame {idx + 1} in detail in order.'
+        else:
+            if len(texts) == 1:
+                prompt = 'Describe the image.'
+            else:
+                prompt = f'Describe the images starting from frame {idx + 1} in order.'
+        response = ''
+        for i, text in enumerate(texts):
+            if i < idx:
+                continue
+            if not isinstance(text, dict):
+                text = random.choice(text)
+            resp = text['response']
+            response += f'{resp}\n'
+        return prompt, response
+    def transform_for_qa_task(self, texts, dataset, images):
+        prompt, response = '', ''
+        for i, text in enumerate(texts):
+            if not isinstance(text, dict):
+                text = random.choice(text)
+            if len(texts) > 1:
+                prompt += f'Question for frame {i+1}:\n' + text['prompt'] + '\n'
+                response += f'Answer to question of frame {i+1}:\n' + text['response'] + '\n'
+            else:
+                prompt += text['prompt'] + '\n'
+                response += text['response'] + '\n'
+        return prompt, response
+    def load_images(self, image_items: List[Dict]) -> List[Image.Image]:
+        """
+        image_items: List[Dict]. each item like:
+            {"video_file": "path/to/video", "frame_indices": [1]}
+            or
+            {"image_file": "path/to/image"}
+        """
+        if image_items is None:
+            raise ValueError(f'image_items is None!')
+        if isinstance(image_items, dict):
+            image_items = [image_items]
+        images = []
+        for image_item in image_items:
+            if 'video_file' in image_item:
+                file_key = 'video_file'
+            elif 'image_file' in image_item:
+                file_key = 'image_file'
+            else:
+                raise KeyError(f'video_file or image_file not in {image_item}')
+            file_path = image_item[file_key]
+            if file_key == 'video_file':
+                frame_indices = image_item.get('frame_indices', None)
+                if frame_indices is None:
+                    raise ValueError(f'read 0 frame: {image_item}')
+                if isinstance(frame_indices, int):
+                    frame_indices = [frame_indices]
+                frames = sample_video(file_path, frame_indices = frame_indices)
+                images.extend(frames)
+            else:
+                if isinstance(file_path, str):
+                    file_path = [file_path]
+                images.extend([read_image(f) for f in file_path])
+        return images
+if __name__ == '__main__':
+    # python3 -m xenon_generation.data.custom_data_parsers.multi_images_parser
+    from tqdm import tqdm
+    from tools.rw_utils import read_jsonlines
+    lines = read_jsonlines('/mnt/bn/videonaslq/VideoCaption/datasets_1009/sharegpt4v_cap100k/part_36.jsonl')
+    lines = lines[:10]
+    parser = MultiImagesParser(n_frames=8)
+    for i, l in tqdm(enumerate(lines)):
+        l_image_processing_config = l.get('image_processing_config', {})
+        messages = parser.transform(l, l_image_processing_config)
+        print(messages)

eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/object_tracking_parser.py ADDED Viewed

	@@ -0,0 +1,160 @@

+from typing import Dict
+import random
+import re
+from torchvision import transforms
+from .utils import sample_video
+def return_same(x):
+    return x
+def _bbox_transform_for_padding(bbox, frame):
+    w1, h1, w2, h2 = bbox
+    width, height = frame.size
+    if width == height:
+        pass
+    elif width > height:
+        h1 += (width - height) // 2
+        h2 += (width - height) // 2
+        height = width
+    else:
+        w1 += (height - width) // 2
+        w2 += (height - width) // 2
+        width = height
+    new_bbox = [w1 / width, h1 / height, w2 / width, h2 / height]
+    new_bbox = [round(i, 2) for i in new_bbox]
+    return new_bbox
+def _bbox_transform_for_resize(bbox, frame):
+    w1, h1, w2, h2 = bbox
+    width, height = frame.size
+    new_bbox = [w1 / width, h1 / height, w2 / width, h2 / height]
+    new_bbox = [round(i, 2) for i in new_bbox]
+    return new_bbox
+class InAndOutCropAndResize(object):
+    """Crop and resize for in_and_out boxes data according to yuchen
+    Args:
+        size: tuple of (width, height)
+    """
+    def __init__(self, size):
+        self.size = size
+    def __call__(self, img):
+        """
+        Args:
+            img (PIL Image): PIL Image
+        Returns:
+            PIL Image: PIL image.
+        """
+        w = img.width
+        h = img.height
+        x0 = int(w * 0.5 - h * 0.375)
+        y0 = int(h * 0.125)
+        x1 = int(w * 0.5 + h * 0.375)
+        y1 = int(h * 0.875)
+        img = img.crop((x0, y0, x1, y1)).resize(self.size)
+        return img
+class ObjectTrackingParser:
+    def __init__(
+        self,
+        n_frames = 8,
+        max_objects = 3,
+        is_training=True,
+    ):
+        self.n_frames = n_frames
+        self.max_objects = max_objects
+        self.is_training = is_training
+        self.img_transform = self.get_img_transform()
+        # fmt: off
+        self.data_temp = {
+            "video_file": "/mnt/bn/llmdatalq/jiaxin/hdvila/20230926/saved/saved_video_clips/0076/lOjn__YCec4.624.1104.mp4",
+            "frame_indices": [154, 157, 160, 163, 166, 169, 172, 175, 178, 181, 184, 187, 190, 193, 196, 199, 202],
+            "objects": {
+                "0": {
+                    "phrase": "person",
+                    "all_frame_bounding_boxes": [[2, 0, 255, 250], [17, 0, 255, 251], [35, 0, 255, 253], [44, 0, 255, 255], [52, 0, 255, 255], [54, 0, 255, 255], [63, 0, 255, 255], [60, 0, 255, 255], [54, 0, 253, 255], [43, 0, 250, 255], [36, 1, 249, 255], [36, 0, 252, 254], [41, 0, 252, 254], [61, 0, 255, 253], [68, 4, 255, 255], [74, 8, 255, 255], [91, 3, 255, 255]]
+                }
+            },
+            "task": "object_tracking",
+            "dataset": "hdvila"
+        }
+        # fmt: on
+    def check_format(self, data_dict: Dict, image_processing_config: Dict):
+        # box tracking 数据不支持 do_crop！！！
+        if image_processing_config.get('do_crop', False):
+            raise ValueError(f'do_crop is not supported in ObjectTrackingParser!')
+    def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+        self.check_format(data_dict, image_processing_config)
+        bbox_transform = _bbox_transform_for_padding if image_processing_config['do_padding'] else _bbox_transform_for_resize
+        # sample n_frames
+        if isinstance(self.n_frames, int):
+            n_frames = self.n_frames
+        else:
+            n_frames = random.choice(self.n_frames)
+        total_frames = list(range(len(data_dict['frame_indices'])))
+        idxs = random.sample(total_frames, min(n_frames, len(total_frames)))
+        idxs.sort()
+        frame_indices = [data_dict['frame_indices'][i] for i in idxs]
+        frames = sample_video(data_dict['video_file'], frame_indices=frame_indices)
+        img_transform = self.img_transform[data_dict['dataset']]
+        frames = [img_transform(f) for f in frames]
+        objects = []
+        for _, o in data_dict['objects'].items():
+            if o is None:
+                continue
+            all_frame_bounding_boxes = [o['all_frame_bounding_boxes'][i] for i in idxs]
+            all_frame_bounding_boxes_t = []
+            for bbox, frame in zip(all_frame_bounding_boxes, frames):
+                all_frame_bounding_boxes_t.append(bbox_transform(bbox, frame))
+            objects.append(all_frame_bounding_boxes_t)
+            if len(objects) >= self.max_objects:
+                break
+        prompt = "Given the bounding box coordinates of these objects in the first frame, output the bounding box coordinates in the following frames.\n{}"
+        response = ''
+        object_info = ''
+        for i, o in enumerate(objects):
+            object_info += f'object {i+1}: {o[0]}\n'
+            response += f'object {i+1}: {o[1:]}\n'
+        response = response.strip()
+        prompt = prompt.format(object_info)
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "video", "video": frames},
+                    {"type": "text", "text": prompt}
+                ]
+            },
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "text", "text": response}
+                ]
+            }
+        ]
+        return messages
+    def get_img_transform(self):
+        return {
+            'webvid': return_same,
+            'hdvila': transforms.Compose([
+                transforms.Resize(size=256),
+                transforms.CenterCrop(size=(256, 256))
+            ]),
+            'hdvila_in_and_out_boxes': InAndOutCropAndResize(size=(256, 256))
+        }

eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/standard_vision_parser.py ADDED Viewed

	@@ -0,0 +1,255 @@

+from typing import Dict, List
+from PIL import Image
+import random
+from .utils import sample_video, read_image, adjust_bbox, filter_ocr_polygon
+class VisionParser:
+    def __init__(
+        self,
+        n_frames=8,
+        max_n_frames=256,
+        is_training=True,
+        video_sampling_strategy={},
+    ):
+        self.n_frames = n_frames
+        self.max_n_frames = max_n_frames
+        self.is_training = is_training
+        self.video_sampling_strategy = video_sampling_strategy
+        # fmt: off
+        self.data_temp = {
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": "Describe the image and the video."},
+                        # 支持的 image 格式：
+                        {"type": "image", "image": {"image_file": "/path/to/image"}},
+                        {"type": "image", "image": {"video_file": "/path/to/video", "frame_indices": 0}},
+                        # 支持的 video 格式：
+                        {"type": "video", "video": {"video_file": "/path/to/video"}},
+                        {"type": "video", "video": {"video_file": "/path/to/video", "frame_indices": [0, 1, 2]}},
+                        {"type": "video", "video": {"video_file": "/path/to/video", "start_frame": 0, "end_frame": 100}},
+                        {"type": "video", "video": {"video_file": "/path/to/video", "time_indices": [0, 1, 2]}},
+                        {"type": "video", "video": {"video_file": "/path/to/video", "start_time": 0, "end_time": 100}},
+                        {"type": "video", "video": {"image_file": ["/path/to/image"]}, "frame_indices": [0, 1, 2]},
+                    ]
+                },
+                {
+                    "role": "assistant",
+                    "content": [
+                        {"type": "text","text": "xxx"}
+                    ]
+                }
+            ],
+            "dataset": "LSMDC",
+            "task": "video/caption"
+        }
+        # fmt: on
+    def check_format(self, data_dict: Dict, image_processing_config: Dict):
+        if image_processing_config.get('do_crop', False) and image_processing_config.get('has_coordinates', False):
+            raise ValueError(f'do_crop and has_coordinates cannot be True at the same time!')
+    """
+    1. 将 messages 中的 image/video 替换成相应的 PIL.Image/List[PIL.Image]
+    2. text 的特殊处理：调整 box；过滤面积太小的OCR
+    """
+    def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+        self.check_format(data_dict, image_processing_config)
+        self.set_n_frames(data_dict)
+        first_image = None # ugly! 需要调整box/过滤面积太小的OCR的数据只有图片任务
+        for msg in data_dict['messages']:
+            if isinstance(msg['content'], dict):
+                msg['content'] = [msg['content']]
+            for content in msg['content']:
+                if content['type'] == 'image':
+                    content['image'] = self.load_image_item(content['image'])
+                    if first_image is None:
+                        first_image = content['image']
+                elif content['type'] == 'video':
+                    video = self.load_video_item(content['video'])
+                    content['video'] = video.pop('frames')
+                    if video:
+                        data_dict['extra_info']['frame_disturb_info'] = video.pop('video_info', {})
+                elif content['type'] == 'text':
+                    pass
+                else:
+                    raise ValueError(f"content['type']={content['type']} MUST be one of ['image', 'video', 'text']")
+        for msg in data_dict['messages']:
+            for content in msg['content']:
+                if content['type'] == 'text':
+                    self.postprocess_text(content, data_dict, image_processing_config, first_image)
+        return data_dict['messages']
+    # set n_frames for each vision item.
+    def set_n_frames(self, data_dict):
+        if isinstance(self.n_frames, int):
+            n_frames = self.n_frames
+        else:
+            n_frames = random.choice(self.n_frames)
+        assert n_frames <= self.max_n_frames
+        curr_n_frames = 0
+        has_dynamic = False
+        for msg in data_dict['messages']:
+            if isinstance(msg['content'], dict):
+                msg['content'] = [msg['content']]
+            for content in msg['content']:
+                if content['type'] == 'image':
+                    curr_n_frames += 1
+                elif content['type'] == 'video':
+                    if 'frame_indices' in content['video']:
+                        curr_n_frames += len(content['video']['frame_indices'])
+                        content['video']['n_frames'] = len(content['video']['frame_indices'])
+                    elif 'time_indices' in content['video']:
+                        curr_n_frames += len(content['video']['time_indices'])
+                        content['video']['n_frames'] = len(content['video']['time_indices'])
+                    elif 'min_n_frames' in content['video']:
+                        content['video']['min_n_frames'] = int(content['video']['min_n_frames'])
+                        curr_n_frames += content['video']['min_n_frames']
+                        content['video']['n_frames'] = content['video']['min_n_frames']
+                        has_dynamic = True
+                    elif 'fps' in content['video']:
+                        content['video']['n_frames'] = self.max_n_frames
+                        curr_n_frames += self.max_n_frames
+                        has_dynamic = True
+                    else:
+                        content['video']['n_frames'] = 0
+                        has_dynamic = True
+        while curr_n_frames < n_frames and has_dynamic:
+            for msg in data_dict['messages']:
+                for content in msg['content']:
+                    if content['type'] == 'video':
+                        if 'frame_indices' in content['video']:
+                            pass
+                        elif 'time_indices' in content['video']:
+                            pass
+                        else:
+                            if curr_n_frames < n_frames:
+                                content['video']['n_frames'] += 1
+                            curr_n_frames += 1
+        while curr_n_frames > self.max_n_frames and has_dynamic:
+            for msg in data_dict['messages']:
+                for content in msg['content']:
+                    if content['type'] == 'video':
+                        if 'frame_indices' in content['video']:
+                            pass
+                        elif 'time_indices' in content['video']:
+                            pass
+                        else:
+                            if curr_n_frames > self.max_n_frames:
+                                content['video']['n_frames'] -= 1
+                            curr_n_frames -= 1
+        for msg in data_dict['messages']:
+            for content in msg['content']:
+                if content['type'] == 'video':
+                    if 'frame_indices' in content['video']:
+                        pass
+                    elif 'time_indices' in content['video']:
+                        pass
+                    else:
+                        n = self.video_sampling_strategy.get('force_frames_n_divisible', 1)
+                        if n > 1 and content['video']['n_frames'] % n != 0:
+                            content['video']['n_frames'] += n - content['video']['n_frames'] % n
+    def load_image_item(self, image_item) -> Image.Image:
+        """
+        image_item:
+        {"image_file": {"lq": "/path/to/image"}}
+        {"video_file": {"lq": "/path/to/video"}, "frame_indices": 0}
+        """
+        # check format
+        if ("image_file" not in image_item) and ("video_file" not in image_item):
+            raise KeyError(f"Key 'image_file' or 'video_file' not found in image_item")
+        if 'image_file' in image_item:
+            if not isinstance(image_item['image_file'], str):
+                raise ValueError(f"{image_item['image_file']} is not a str!")
+        if 'video_file' in image_item:
+            if not isinstance(image_item['frame_indices'], int):
+                raise ValueError(f"{image_item['frame_indices']} is not a int!")
+        if 'image_file' in image_item:
+            image = read_image(image_item['image_file'])
+        else:
+            frame_indices = [image_item['frame_indices']]
+            image = sample_video(image_item['video_file'], frame_indices = frame_indices)[0]
+        return image
+    def load_video_item(self, video_item) -> List[Image.Image]:
+        """
+        video_item:
+        {"video_file": {"lq": "/path/to/video"}, "n_frames": 8}
+        {"video_file": {"lq": "/path/to/video"}, "frame_indices": [0, 1, 2], "n_frames": 3}
+        {"video_file": {"lq": "/path/to/video"}, "start_frame": 0, "end_frame": 100, "n_frames": 8}
+        {"video_file": {"lq": "/path/to/video"}, "time_indices": [0, 1, 2], "n_frames": 3}
+        {"video_file": {"lq": "/path/to/video"}, "start_time": 0, "end_time": 100, "n_frames": 8}
+        {"image_file": {"lq": ["/path/to/image"]}, "frame_indices": [0, 1, 2], "n_frames": 3}
+        """
+        # check format
+        if ("image_file" not in video_item) and ("video_file" not in video_item):
+            raise KeyError(f"Key 'image_file' or 'video_file' not found in video_item")
+        video_path = video_item.get('video_file', video_item.get('image_file'))
+        n_frames = video_item.get('n_frames', None)
+        frame_indices = video_item.get('frame_indices', None)
+        start_frame = video_item.get('start_frame', None)
+        end_frame = video_item.get('end_frame', None)
+        time_indices = video_item.get('time_indices', None)
+        start_time = video_item.get('start_time', None)
+        end_time = video_item.get('end_time', None)
+        mask_boxes = video_item.get('mask_boxes', None)
+        fps = video_item.get('fps', None)
+        frames, frame_indices = sample_video(
+            video_path=video_path,
+            frame_indices=frame_indices,
+            start_frame=start_frame,
+            end_frame=end_frame,
+            n_frames=n_frames,
+            time_indices=time_indices,
+            start_time=start_time,
+            end_time=end_time,
+            sampling_fps=fps,
+            mask_boxes=mask_boxes,
+            is_training=self.is_training,
+            video_sampling_strategy=self.video_sampling_strategy,
+            return_frame_ids=True,
+        )
+        if self.video_sampling_strategy.get('use_multi_images_for_video', False):
+            new_frames = []
+            for f in frames:
+                new_frames.extend([f, f])
+            frames = new_frames
+        if isinstance(frame_indices, dict):
+            return {
+                'frames': frames,
+                'video_info': frame_indices
+            }
+        return {'frames': frames}
+    def postprocess_text(self, content, data_dict, image_processing_config, first_image):
+        if image_processing_config.get('has_coordinates') and image_processing_config.get('do_padding'):
+            content['text'] = adjust_bbox(content['text'], frame=first_image)
+        if data_dict.get('task') == 'image/OCR' and image_processing_config.get('has_coordinates'):
+            content['text'] = filter_ocr_polygon(content['text'])

eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils.py ADDED Viewed

	@@ -0,0 +1,452 @@

+from typing import List, Dict, Union
+import os
+import random
+import tempfile
+from PIL import Image, ImageSequence
+import base64
+import io
+import re
+import uuid
+import json
+import numpy as np
+import pyarrow.fs as pf
+import func_timeout
+from func_timeout import func_set_timeout
+import math
+# fmt: on
+import decord
+# fmt: off
+def denorm_box(points, height, width):
+    new_points = []
+    for p in points:
+        new_points.append((round(p[0] * width), round(p[1] * height)))
+    return new_points
+def process_image_for_tiktok(frames: List[Image.Image], mask_boxes):
+    mask_boxes = mask_boxes[:len(frames)]
+    frames = [np.array(f) for f in frames]
+    # assert len(mask_boxes) == len(frames)
+    height, width = frames[0].shape[:2]
+    new_frames = []
+    for boxes, frame in zip(mask_boxes, frames):
+        left, top, right, bottom = 0, 0, width, height
+        for box in boxes:
+            pts = np.array(denorm_box(box, height, width), np.int32)
+            upper_bound = max([p[1] for p in pts]) + 30
+            if bottom > upper_bound:
+                bottom = upper_bound
+            frame[pts[0][1]: pts[2][1], pts[0][0]: pts[1][0]] = 0
+        new_frames.append(Image.fromarray(frame[top: bottom, left: right]))
+    return new_frames
+# 先将视频分成 n_frames 份。训练时，每份随机抽一帧；测试时，每份抽中间的那一帧。
+def _sample_frame_indices_v2(
+        total_frames: int,
+        n_frames: int,
+        is_training=False,
+        video_sampling_strategy = {},
+    ):
+    total_frames_idxs = list(range(total_frames))
+    if total_frames <= n_frames:
+        return total_frames_idxs
+    k, m = divmod(total_frames, n_frames)
+    frame_splits = [total_frames_idxs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in list(range(n_frames))]
+    if is_training:
+        sample_ids = [random.choice(i) for i in frame_splits]
+    else:
+        sample_ids = [i[(len(i)+1)//2-1] for i in frame_splits]
+    return sample_ids
+# 均匀抽帧，必采样首尾帧。
+def _sample_frame_indices_v1(total_frames: int, n_frames: int, is_training=False, video_sampling_strategy = {}):
+    if n_frames == 1:
+        return [0]  # sample first frame in default
+    if total_frames <= n_frames:
+        return list(range(total_frames))
+    sample_ids = [round(i * (total_frames - 1) / (n_frames - 1)) for i in range(n_frames)]
+    return sample_ids
+def conduct_disturb_frame(frame_indices):
+    disturb_type = random.choice(['exchange', 'crop', 'reverse', 'discard'])
+    n_frames = len(frame_indices)
+    frame_indices_new = []
+    if disturb_type == 'exchange':
+        # 均等分成4个segments, 随机交换两个segment
+        seg_len = math.ceil(n_frames / 4)
+        seg_idxs = list(range(0, n_frames, seg_len))
+        target_idxs = random.sample(range(0, 4), 2)
+        seg_idxs[target_idxs[0]], seg_idxs[target_idxs[1]] = seg_idxs[target_idxs[1]], seg_idxs[target_idxs[0]]
+        for idx in seg_idxs:
+            frame_indices_new += frame_indices[idx: idx+seg_len]
+    elif disturb_type == 'crop':
+        # 随机截取出3/4时长，再采均匀n_frames帧
+        crop_len = math.ceil(n_frames / 4)
+        idx_s = random.choice(range(0, crop_len+1))
+        idx_e = n_frames - 1 - (crop_len - idx_s)
+        frame_indices_new = np.linspace(frame_indices[idx_s], frame_indices[idx_e], n_frames, dtype=int).tolist()
+    elif disturb_type == 'reverse':
+        # 随机选择长度为[1/2, 1]时长的片段进行顺序颠倒
+        reverse_len = math.ceil(random.uniform(0.5,1) * n_frames)
+        idx_s = random.choice(range(0, n_frames-reverse_len+1))
+        idx_e = idx_s + reverse_len - 1
+        frame_indices_new = frame_indices[:idx_s] + list(reversed(frame_indices[idx_s: idx_e+1])) + frame_indices[idx_e+1:]
+    elif disturb_type == 'discard':
+        # 随机丢弃一半帧
+        frame_indices_new = random.sample(frame_indices, n_frames//2)
+        frame_indices_new.sort()
+    return disturb_type, frame_indices_new
+@func_set_timeout(60)
+def _download_file(path):
+    if path.startswith("hdfs"):
+        local_path = os.path.join(tempfile.gettempdir(), f'{uuid.uuid4()}_' + os.path.basename(path))
+        fs = pf.HadoopFileSystem.from_uri(uri="hdfs://harunava")
+        hdfs_file = fs.open_input_file(path)
+        file_size = hdfs_file.size()
+        if file_size > 1024 * 1024 * 1024: # 1G
+            os.system(f"hadoop fs -get --ct 8 -c 512 '{path}' '{local_path}' > /dev/null 2>&1")
+        elif file_size > 1024 * 1024 * 100: # 100M
+            os.system(f"hadoop fs -get '{path}' '{local_path}' > /dev/null 2>&1")
+        else:
+            local_fs = pf.LocalFileSystem()
+            with local_fs.open_output_stream(local_path) as local_file:
+                while True:
+                    chunk = hdfs_file.read(1024 * 1024 * 100)  # Reading 1MB chunks, you can adjust this as needed
+                    if not chunk:
+                        break
+                    local_file.write(chunk)
+    else:
+        local_path = path
+    if not os.path.exists(local_path):
+        raise FileNotFoundError(f'{local_path}')
+    return local_path
+def download_file(path):
+    try:
+        # with timer(f'Download {path}'):
+        return _download_file(path)
+    except func_timeout.exceptions.FunctionTimedOut as e:
+        raise ValueError(e)
+class VideoReader:
+    def __init__(self, path: str) -> None:
+        self.path = path
+        self.local_path = self.preprocess()
+        self.vr = decord.VideoReader(self.local_path, num_threads=1, ctx=decord.cpu(0), fault_tol=1)
+        self.vr.seek(0)
+        self._length = len(self.vr)
+        self._fps = self.vr.get_avg_fps()
+    @property
+    def length(self):
+        return self._length
+    @property
+    def fps(self):
+        return self._fps
+    def sample(self, frame_indices) -> List[Image.Image]:
+        frames = self.vr.get_batch(frame_indices).asnumpy()
+        frames = [Image.fromarray(f).convert('RGB') for f in frames]
+        return frames
+    def preprocess(self):
+        return download_file(self.path)
+    def postprocess(self):
+        if self.path.startswith("hdfs"):
+            os.remove(self.local_path)
+class ImageSeqReader:
+    def __init__(self, path: List[str]) -> None:
+        self.path = path
+        self.local_path = self.preprocess()
+        self._length = len(self.local_path)
+        self._fps = None
+    @property
+    def length(self):
+        return self._length
+    @property
+    def fps(self):
+        return self._fps
+    def sample(self, frame_indices):
+        return [read_image(self.local_path[i]) for i in frame_indices]
+    def preprocess(self):
+        local_paths = []
+        for p in self.path:
+             local_paths.append(p)
+        return local_paths
+    def postprocess(self):
+        pass
+class GIFReader:
+    def __init__(self, path: str) -> None:
+        self.path = path
+        self.local_path = self.preprocess()
+        self.gif = Image.open(self.local_path)
+        self._length = self.gif.n_frames
+        duration = self.gif.info.get('duration', 0) / 1000  # 转换为秒
+        if duration > 0:
+            self._fps = 1 / duration
+        else:
+            self._fps = None
+    @property
+    def length(self):
+        return self._length
+    @property
+    def fps(self):
+        return self._fps
+    def sample(self, frame_indices):
+        frames = []
+        i = 0
+        for frame in ImageSequence.Iterator(self.gif):
+            if i in frame_indices:
+                frames.append(frame.convert('RGB'))
+            i += 1
+        return frames
+    def preprocess(self):
+        return download_file(self.path)
+    def postprocess(self):
+        if self.path.startswith("hdfs"):
+            os.remove(self.local_path)
+def check_frame_indices(frame_indices, total_frames, video_path):
+    if frame_indices[-1] == total_frames:
+        frame_indices[-1] = total_frames - 1
+    valid_frame_indices = [i for i in frame_indices if i >= 0 and i < total_frames]
+    if len(valid_frame_indices) != len(frame_indices):
+        print(f'[Error] frame out of index. video_path={video_path}, frame_indices={frame_indices}, total_frames={total_frames}', flush=True)
+    return valid_frame_indices
+def sample_video(
+    video_path: Union[str, List[str]],
+    frame_indices: List[int] = None,
+    start_frame:int=None,
+    end_frame:int=None,
+    n_frames:int = None,
+    time_indices: List[float] = None,
+    start_time:int=None,
+    end_time:int=None,
+    sampling_fps:float=None,
+    mask_boxes=None,
+    is_training:bool=False,
+    video_sampling_strategy={'video_sampler_version': 'v1'},
+    return_frame_ids: bool=False,
+    ) -> List[Image.Image]:
+    do_frame_disturb = video_sampling_strategy.get('do_frame_disturb', False)
+    if isinstance(video_path, str):
+        if video_path.endswith('.gif'):
+            reader = GIFReader(video_path)
+        else:
+            reader = VideoReader(video_path)
+    else:
+        reader = ImageSeqReader(video_path)
+    total_frames = reader.length
+    fps = reader.fps
+    if sampling_fps is not None:
+        frame_indices = list(range(0, total_frames, round(fps / sampling_fps)))
+        if len(frame_indices) > n_frames:
+            frame_indices = None
+    if time_indices is not None:
+        frame_indices = [round(float(i) * fps) for i in time_indices]
+    if start_time is not None and end_time is not None:
+        start_frame = round(start_time * fps)
+        end_frame = round(end_time * fps)
+    if frame_indices is None:
+        start_frame = 0 if start_frame is None else round(start_frame)
+        end_frame = total_frames - 1 if end_frame is None else round(end_frame)
+        if end_frame == total_frames:
+            end_frame -= 1
+        if video_sampling_strategy['video_sampler_version'] == 'v1':
+            # 均匀抽帧，必采样首尾帧。
+            frame_indices = _sample_frame_indices_v1(end_frame - start_frame + 1, n_frames, is_training, video_sampling_strategy)
+        elif video_sampling_strategy['video_sampler_version'] == 'v2':
+            frame_indices = _sample_frame_indices_v2(end_frame - start_frame + 1, n_frames, is_training, video_sampling_strategy)
+        else:
+            raise ValueError(f"video_sampler_version={video_sampling_strategy['video_sampler_version']} must be 'v1' or 'v2'")
+        frame_indices = [i + start_frame for i in frame_indices]
+    frame_indices = check_frame_indices(frame_indices, total_frames, video_path)
+    if do_frame_disturb:
+        frame_disturb_type, frame_indices_new = conduct_disturb_frame(frame_indices)
+        frame_indices_raw = frame_indices[:]
+        frame_indices = frame_indices_new
+    frames = reader.sample(frame_indices)
+    if mask_boxes is not None:
+        frames = process_image_for_tiktok(frames, mask_boxes)
+    n = video_sampling_strategy.get('force_frames_n_divisible', 1)
+    if n > 1 and len(frames) % n != 0:
+        new_n = n - len(frames) % n
+        frames.extend([Image.new(mode='RGB', size=frames[-1].size) for _ in range(new_n)])
+    reader.postprocess()
+    if do_frame_disturb:
+        return frames, {"frame_indices": frame_indices, "disturb_type": frame_disturb_type, "frame_indices_raw": frame_indices_raw}
+    if return_frame_ids:
+        return frames, frame_indices
+    return frames
+def load_image_from_base64String(img_path):
+    img = base64.b64decode(open(img_path, "rb").read())
+    buf = io.BytesIO(img)
+    img = Image.open(buf)
+    return img
+def read_image(image_path):
+    local_file = download_file(image_path)
+    if local_file.endswith('.dat'):
+        image = load_image_from_base64String(local_file)
+    else:
+        image = Image.open(local_file).convert('RGB')
+    if image_path.startswith("hdfs"):
+        os.remove(local_file)
+    return image
+def adjust_bbox(text, frame):
+    width, height = frame.size
+    new_text = []
+    start_idx = 0
+    for match in re.finditer(r'\[(\d+(\.\d+)?,\s*)+\d+(\.\d+)?\]', text):
+        coordinate_matches = re.findall(r"([0-9.]+)", match.group(0))
+        xys = [float(coord) for coord in coordinate_matches]
+        new_xys = []
+        for i in range(len(xys)):
+            p = xys[i]
+            if width == height:
+                pass
+            if width > height and i % 2 != 0:
+                p = xys[i] * height
+                p += (width - height) // 2
+                p = round(p / width, 2)
+            if height > width and i % 2 == 0:
+                p = xys[i] * width
+                p += (height - width) // 2
+                p = round(p / height, 2)
+            new_xys.append(p)
+        new_text.append(text[start_idx: match.span()[0]])
+        new_text.append(str(new_xys))
+        start_idx = match.span()[1]
+    new_text.append(text[start_idx: ])
+    text = ''.join(new_text)
+    return text
+def bbox_area(vertices, convert_format = True):
+    if convert_format:
+        vertices = list(zip(vertices[::2], vertices[1::2]))
+    x0, y0 = vertices[0]
+    x1, y1 = vertices[1]
+    return abs((x1 - x0) * (y1 - y0))
+def polygon_area(vertices, convert_format = True):
+    if convert_format:
+        vertices = list(zip(vertices[::2], vertices[1::2]))
+    n = len(vertices)  # 多边形顶点的数量
+    if n == 2:
+        return bbox_area(vertices, convert_format=False)
+    area = 0
+    for i in range(n):
+        x1, y1 = vertices[i]
+        x2, y2 = vertices[(i + 1) % n]
+        area += x1 * y2 - x2 * y1
+    return abs(area) / 2
+def get_text_len(text_line):
+    l = 0
+    for c in text_line:
+        if '\u4e00' <= c <= '\u9fff':
+            l += 1
+        else:
+            l += 0.5
+    return l
+def filter_ocr_polygon(response, area_threshold=0.0005):
+    try:
+        resp = json.loads(response)
+    except:
+        return response
+    new_resp = []
+    for coords, text_line in resp:
+        area = polygon_area(coords, convert_format=True)
+        text_len = get_text_len(text_line)
+        if text_len == 0:
+            continue
+        if area / text_len < area_threshold:
+            continue
+        new_resp.append([coords, text_line])
+    new_resp = json.dumps(new_resp, ensure_ascii=False)
+    return new_resp
+def put_pred_to_data_dict(prediction, data_dict):
+    msg = data_dict['messages'][-1]
+    if msg['role'] == 'assistant':
+        msg['content'][-1]['text'] = prediction
+    else:
+        data_dict['messages'].append({
+            "role": "assistant",
+            "content": [{"type": "text", "text": prediction}]
+        })
+def get_prompt_from_data_dict(data_dict):
+    prompt = ""
+    for msg in data_dict['messages']:
+        role = msg['role']
+        assert role in {'system', 'user', 'assistant'}
+        for content in msg['content']:
+            if content['type'] == 'text':
+                if content['text']:
+                    prompt += f"[{role}]: {content['text']}"
+            elif content['type'] == 'image':
+                prompt += f"[{role}]: <image>"
+            elif content['type'] == 'video':
+                prompt += f"[{role}]: <video>"
+            prompt += '\n'
+    return prompt

eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/utils_visualize.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import re
+from typing import Dict, List, Optional
+from PIL import Image, ImageDraw, ImageFont
+def scale_polygon(polygon, w, h):
+    new_polygon = []
+    for (x, y) in polygon:
+        new_polygon.append((x * w, y * h))
+    return new_polygon
+def draw_polygon(image: Image.Image, points: List[List[int]], label: Optional[str] = None):
+    draw = ImageDraw.Draw(image)
+    if len(points) > 2:
+        draw.polygon(points, outline="red", width=3)
+    elif len(points) == 2:
+        draw.rectangle(points, outline="red", width=3)
+    else:
+        raise ValueError(f'points={points} only has one point!')
+    if label is not None:
+        font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 20)
+        draw.text(points[0], label, font=font, fill=(0, 0, 255))
+    return image
+def visualize_image_bbox(data_dict, image_processing_config, processor):
+    if image_processing_config.get('has_coordinates') != True:
+        return
+    messages = data_dict['messages']
+    polygons = []
+    first_image_content = None
+    for msg in messages:
+        for content in msg['content']:
+            if content['type'] == 'text':
+                for match in re.finditer(r'\[(\d+(\.\d+)?,\s*)+\d+(\.\d+)?\]', content["text"]):
+                    coordinate_matches = re.findall(r"([0-9.]+)", match.group(0))
+                    coords = [float(coord) for coord in coordinate_matches]
+                    polygons.append(list(zip(coords[::2], coords[1::2])))
+            elif first_image_content is None and content['type'] == 'image':
+                first_image_content = content
+    first_image = first_image_content['image']
+    first_image = processor.preprocess_image(first_image, image_processing_config)
+    w, h = first_image.size
+    if len(polygons) > 0:
+        for i, polygon in enumerate(polygons):
+            polygon = scale_polygon(polygon, w, h)
+            first_image = draw_polygon(first_image, polygon, label=str(i))
+    first_image_content['image'] = first_image

eval_scripts/DREAM-1K/tarsier/dataset/custom_data_parsers/video_permutation_parser.py ADDED Viewed

	@@ -0,0 +1,137 @@

+from typing import Dict, List
+import random
+from PIL import Image, ImageDraw, ImageFont
+from .utils import sample_video
+class VideoPermutationParser:
+    def __init__(
+        self,
+        n_frames=8,
+        is_training=True,
+        frame_nums = list(range(8, 25)),
+        video_sampling_strategy={},
+    ):
+        self.n_frames = n_frames
+        self.is_training = is_training
+        self.frame_nums = frame_nums
+        self.video_sampling_strategy = video_sampling_strategy
+        # fmt: off
+        self.data_temp = {
+            "text": [{
+                "prompt": "<video>",
+                "response": ""
+            }],
+            "video": [{
+                "video_file": {
+                    "yg": "/mnt/bn/videonasyg/videos/webvid_10M_download/011851_011900/1047443473.mp4",
+                    "lq": "/mnt/bn/llmdatalq/jiangnan/video_generation/webvid_10M_download/20230609/videos/011851_011900/1047443473.mp4"
+                },
+                "frame_indices": [0, 85, 171, 256, 342, 427, 513, 598]
+            }],
+        }
+        # fmt: on
+    def check_format(self, data_dict: Dict):
+        pass
+        # for k in self.data_temp.keys():
+        #     assert k in data_dict
+    def transform(self, data_dict: Dict, image_processing_config: Dict = None) -> Dict:
+        self.check_format(data_dict)
+        frames = self.load_video_item(data_dict['video'][0])
+        # frames = self.add_text_to_frames(frames) # for debug
+        idxs = list(range(1, len(frames) + 1))
+        random.shuffle(idxs)
+        prefix_len = int(3/8*len(idxs))
+        shuffled_frames = [frames[i-1] for i in idxs]
+        prompt = f'Output the correct chronological order of scrambled video frames. The order of the first {prefix_len} ones are:\n'
+        prompt += '\n'.join([str(i) for i in idxs[: prefix_len]]) + '\nOutput the order of the following frames:'
+        response = '\n'.join([str(i) for i in idxs[prefix_len: ]])
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "video", "video": shuffled_frames},
+                    {"type": "text", "text": prompt},
+                ]
+            },
+            {
+                "role": "assistant",
+                "content": [
+                    {"type": "text", "text": response}
+                ]
+            }
+        ]
+        return messages
+    def load_video_item(self, video_item) -> List[Image.Image]:
+        """
+        video_item:
+        {"video_file": "/path/to/video", "n_frames": 8}
+        {"video_file": "/path/to/video", "frame_indices": [0, 1, 2], "n_frames": 3}
+        {"video_file": "/path/to/video", "start_frame": 0, "end_frame": 100, "n_frames": 8}
+        {"video_file": "/path/to/video", "time_indices": [0, 1, 2], "n_frames": 3}
+        {"video_file": "/path/to/video", "start_time": 0, "end_time": 100, "n_frames": 8}
+        {"image_file": ["/path/to/image"], "frame_indices": [0, 1, 2], "n_frames": 3}
+        """
+        # check format
+        if ("image_file" not in video_item) and ("video_file" not in video_item):
+            raise KeyError(f"Key 'image_file' or 'video_file' not found in video_item")
+        video_path = video_item.get('video_file', video_item.get('image_file'))
+        n_frames = video_item.get('n_frames', None)
+        frame_indices = video_item.get('frame_indices', None)
+        start_frame = video_item.get('start_frame', None)
+        end_frame = video_item.get('end_frame', None)
+        time_indices = video_item.get('time_indices', None)
+        start_time = video_item.get('start_time', None)
+        end_time = video_item.get('end_time', None)
+        mask_boxes = video_item.get('mask_boxes', None)
+        n_frames = random.choice(self.frame_nums)
+        n = self.video_sampling_strategy.get('force_frames_n_divisible', 1)
+        if n > 1 and n_frames % n != 0:
+            n_frames += n - n_frames % n
+        frames, frame_indices = sample_video(
+            video_path=video_path,
+            frame_indices=frame_indices,
+            start_frame=start_frame,
+            end_frame=end_frame,
+            n_frames=n_frames,
+            time_indices=time_indices,
+            start_time=start_time,
+            end_time=end_time,
+            mask_boxes=mask_boxes,
+            is_training=self.is_training,
+            video_sampling_strategy=self.video_sampling_strategy,
+            return_frame_ids=True,
+        )
+        return frames
+    def add_text_to_frames(self, frames: List[Image.Image]):
+        new_frames = []
+        for i, image in enumerate(frames):
+            draw = ImageDraw.Draw(image)
+            font = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 100)
+            text_position = (50, 50)
+            text_content = f'{i+1}'
+            text_color = (255, 0, 0)
+            draw.text(text_position, text_content, font=font, fill=text_color)
+            new_frames.append(image)
+        return new_frames

eval_scripts/DREAM-1K/tarsier/dataset/tarsier_datamodule.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""Datamodule for Llava Pretraining and Finetuning"""
+import os
+import re
+from PIL import Image
+import numpy as np
+import re
+import tempfile
+from typing import Dict, List, Union, Tuple
+import traceback
+import json
+import torch
+import torch.nn.functional as F
+from transformers import DataCollatorForSeq2Seq
+from tools.rw_utils import read_jsonlines
+from torch.utils.data import Dataset, DataLoader
+np_str_obj_array_pattern = re.compile(r"[SaUO]")
+default_collate_err_msg_format = (
+    "default_collate: batch must contain tensors, numpy arrays, numbers, "
+    "dicts or lists; found {}"
+)
+from .custom_data_parsers.standard_vision_parser import VisionParser
+from .custom_data_parsers.object_tracking_parser import ObjectTrackingParser
+from .custom_data_parsers.multi_images_parser import MultiImagesParser
+from .custom_data_parsers.video_permutation_parser import VideoPermutationParser
+from .custom_data_parsers.utils_visualize import visualize_image_bbox
+from .tarsier_processor import TarsierProcessor
+from tools.rw_utils import NumpyArrayEncoder
+from .utils import DictToObject
+class TarsierDataProcessor:
+    def __init__(
+        self,
+        processor: TarsierProcessor,
+        n_frames: Union[int, list],
+        max_n_frames=256,
+        max_pixels=int(1280 * 720 // 2),
+        min_pixels=0,
+        max_seq_len=None,
+        is_training=True,  # 会影响：1. 训练和测试时采帧不同；2. 测试时忽略 response。
+        print_data_error=True,
+        do_image_padding=False,
+        do_image_crop=False,
+        do_image_resize=True,
+        video_sampling_strategy={},
+        prompt='',
+        train_task='sft',
+        **kwargs
+    ):
+        self.kwargs = kwargs
+        self.processor = processor
+        self.pad_collator = DataCollatorForSeq2Seq(processor.tokenizer, padding='longest')
+        self.processor.max_seq_len = self.tokenizer.model_max_length if max_seq_len is None else max_seq_len
+        self.n_frames = n_frames
+        self.max_n_frames = max_n_frames
+        self.max_pixels = max_pixels
+        self.min_pixels = min_pixels
+        self.is_training = is_training
+        self.print_data_error = print_data_error
+        self.do_image_padding = do_image_padding
+        self.do_image_crop = do_image_crop
+        self.do_image_resize = do_image_resize
+        self.video_sampling_strategy = video_sampling_strategy
+        self.prompt = prompt
+        self.train_task = train_task
+        self.object_tracking_parser = ObjectTrackingParser(
+            n_frames=self.n_frames,
+            max_objects=4,
+            is_training=self.is_training,
+        )
+        self.multi_images_parser = MultiImagesParser(
+            n_frames=self.n_frames,
+            is_training=self.is_training,
+        )
+        self.video_permutation_parser = VideoPermutationParser(
+            n_frames=self.n_frames,
+            is_training=self.is_training,
+            video_sampling_strategy=self.video_sampling_strategy,
+        )
+        self.vision_parser = VisionParser(
+            n_frames=self.n_frames,
+            max_n_frames=self.max_n_frames,
+            is_training=self.is_training,
+            video_sampling_strategy=self.video_sampling_strategy
+        )
+    def select_parser(self, data_dict):
+        if data_dict.get('task', None) == 'video/object_tracking':
+            return self.object_tracking_parser
+        elif data_dict.get('task', None) == 'multi_images':
+            return self.multi_images_parser
+        elif data_dict.get('dataset', None) == 'video_permutation':
+            return self.video_permutation_parser
+        else:
+            return self.vision_parser
+    def parse_image_processing_config(self, data_dict):
+        image_processing_config=data_dict.get('image_processing_config', {})
+        do_padding = image_processing_config.get('do_padding', self.do_image_padding)
+        do_crop = image_processing_config.get('do_crop', self.do_image_crop)
+        do_resize = image_processing_config.get('do_resize', self.do_image_resize)
+        max_pixels = image_processing_config.get('max_pixels', self.max_pixels)
+        min_pixels = image_processing_config.get('min_pixels', self.min_pixels)
+        assert min_pixels <= max_pixels
+        image_processing_config['do_padding'] = do_padding
+        image_processing_config['do_crop'] = do_crop
+        image_processing_config['do_resize'] = do_resize
+        image_processing_config['max_pixels'] = max_pixels
+        image_processing_config['min_pixels'] = min_pixels
+        return image_processing_config
+    def _transform(self, raw_data_dict: Dict) -> Dict:
+        data_dict = json.loads(json.dumps(raw_data_dict, cls=NumpyArrayEncoder))
+        del raw_data_dict
+        if self.prompt:
+            for msg in data_dict['messages']:
+                if msg['role'] == 'user':
+                    for content in msg['content']:
+                        if content['type'] == 'text':
+                            content['text'] = self.prompt
+        data_dict_copy = json.loads(json.dumps(data_dict, cls=NumpyArrayEncoder))
+        image_processing_config = self.parse_image_processing_config(data_dict)
+        parser = self.select_parser(data_dict)
+        messages = parser.transform(data_dict, image_processing_config)
+        data_dict_copy['extra_info'] = data_dict.pop('extra_info', {})
+        # visualize_image_bbox(data_dict, image_processing_config, self.processor)
+        outputs = self.processor(messages, image_processing_config, is_training=self.is_training)
+        # if not self.is_training:
+        outputs['raw_data_dict'] = data_dict_copy
+        return [outputs]
+    def _split_chosen_rejected(self, data_dict: Dict):
+        chosen_data_dict = data_dict
+        rejected_data_dict = json.loads(json.dumps(data_dict, cls=NumpyArrayEncoder))
+        for msg in chosen_data_dict['messages']:
+            if msg['role'] == 'assistant':
+                for content in msg['content']:
+                    if content['type'] == 'text':
+                        content['text'] = content['chosen']
+        for msg in rejected_data_dict['messages']:
+            if msg['role'] == 'assistant':
+                for content in msg['content']:
+                    if content['type'] == 'text':
+                        content['text'] = content['rejected']
+        return chosen_data_dict, rejected_data_dict
+    def transform(self, data_dict: Dict) -> Dict:
+        try:
+            if self.train_task == 'dpo':
+                chosen_data_dict, rejected_data_dict = self._split_chosen_rejected(data_dict)
+                return self._transform(chosen_data_dict) + self._transform(rejected_data_dict)
+            return self._transform(data_dict)
+        except Exception as e:
+            if self.print_data_error:
+                print(traceback.format_exc())
+                print(f'Error occurs when processing: \n{data_dict}')
+            return []
+    def batch_transform(self, batch_data: List[Dict]) -> Dict:
+        model_inputs = {}
+        # if not self.is_training:
+        raw_data_dict = [d.pop('raw_data_dict') for d in batch_data]
+        model_inputs['raw_data_dict'] = raw_data_dict
+        batch_pixel_values = [d.pop('pixel_values') for d in batch_data if 'pixel_values' in d]
+        batch_image_grid_thw = [d.pop('image_grid_thw') for d in batch_data if 'image_grid_thw' in d]
+        if len(batch_pixel_values) == 0:
+            vision_placeholder = self.get_vision_placeholder()
+            batch_pixel_values = [vision_placeholder.get('pixel_values')]
+            batch_image_grid_thw = [vision_placeholder.get('image_grid_thw')] if 'image_grid_thw' in vision_placeholder else []
+        model_inputs['pixel_values'] = torch.cat(batch_pixel_values, dim=0)
+        if len(batch_image_grid_thw) > 0:
+            model_inputs['image_grid_thw'] = torch.cat(batch_image_grid_thw, dim=0)
+        batch_num_images = [d.pop('num_images') for d in batch_data]
+        model_inputs['num_images'] = torch.tensor(batch_num_images)
+        model_inputs.update(self.pad_collator(batch_data))
+        return model_inputs
+    def __call__(self, batch_data: Union[Dict, List[Dict]]) -> Dict:
+        if isinstance(batch_data, dict):
+            batch_data = [batch_data]
+        batch = [self.transform(d)[0] for d in batch_data]
+        return self.batch_transform(batch)
+    def get_vision_placeholder(self):
+        messages = [{"role": "user", "content": [{"type": "image", "image": Image.new(mode='RGB', size=(336, 336))}]}]
+        image_processing_config = self.parse_image_processing_config({})
+        return self.processor(messages, image_processing_config)
+    def get_text_placeholder(self):
+        messages = [
+            {"role": "user", "content": [{"type": "text", "text": "Hello!"}]},
+            {"role": "assistant", "content": [{"type": "text", "text": "Thank you very much"}]},
+        ]
+        image_processing_config = self.parse_image_processing_config({})
+        return self.processor(messages, image_processing_config)
+def init_processor(processor: Union[TarsierProcessor, str]=None, config: Dict=None):
+    config = DictToObject(config) if isinstance(config, dict) else config
+    if isinstance(processor, str):
+        sub_processor = TarsierProcessor.from_pretrained(
+            processor,
+            padding_side='left',
+            trust_remote_code=True
+        )
+    else:
+        sub_processor = processor
+    processor = TarsierDataProcessor(
+        processor=sub_processor,
+        n_frames=config.n_frames,
+        max_n_frames=config.max_n_frames,
+        max_pixels=config.max_pixels,
+        min_pixels=config.min_pixels,
+        max_seq_len=config.max_seq_len,
+        is_training=config.is_training,
+        print_data_error=config.print_data_error,
+        do_image_padding=config.do_image_padding,
+        do_image_crop=config.do_image_crop,
+        do_image_resize=config.do_image_resize,
+        video_sampling_strategy=config.video_sampling_strategy,
+        prompt=config.prompt,
+        train_task=config.train_task
+    )
+    return processor
+class TarsierDataset(Dataset):
+    def __init__(self, ann_path="", anns=None, config: Dict=None, processor: Union[TarsierDataProcessor, TarsierProcessor, str]=None):
+        self.config = DictToObject(config) if isinstance(config, dict) else config
+        if not isinstance(processor, TarsierDataProcessor):
+            self.processor = init_processor(processor, config)
+        else:
+            self.processor = processor
+        if anns is None:
+            self.anns = []
+            if isinstance(ann_path, str):
+                ann_path = [ann_path]
+            for path in ann_path:
+                self.anns.extend(read_jsonlines(path))
+        else:
+            self.anns = anns
+    def __len__(self):
+        return len(self.anns)
+    def __getitem__(self, index):
+        if index < 0 or index >= len(self.anns):
+            raise IndexError("Index out of range")
+        try:
+            ann = self.anns[index]
+            model_inputs = self.processor(ann)
+        except Exception as e:
+            print(f"Load data error: {e}")
+            return ann, None
+        return ann, model_inputs

eval_scripts/DREAM-1K/tarsier/dataset/tarsier_processor.py ADDED Viewed

	@@ -0,0 +1,240 @@

+from typing import List, Union
+from PIL import Image
+import torch
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.image_utils import ImageInput, get_image_size, to_numpy_array
+from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, _validate_images_text_input_order
+from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
+from transformers.utils import logging
+from transformers import Qwen2VLImageProcessor
+from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize
+logger = logging.get_logger(__name__)
+class TarsierProcessorKwargs(ProcessingKwargs, total=False):
+    _defaults = {
+        "text_kwargs": {},
+        "images_kwargs": {},
+    }
+class TarsierProcessor(ProcessorMixin):
+    attributes = ["image_processor", "tokenizer"]
+    valid_kwargs = ["chat_template", "image_token", "patch_size", "merge_size", "temporal_patch_size", "max_seq_len"]
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(
+                self,
+                image_processor=None,
+                tokenizer=None,
+                chat_template=None,
+                image_token="<image>",
+                patch_size=None,
+                merge_size=1,
+                temporal_patch_size=1,
+                max_seq_len=8192,
+                **kwargs,
+            ) -> None:
+        self.image_token = image_token
+        self.patch_size = patch_size
+        self.merge_size = merge_size
+        self.temporal_patch_size = temporal_patch_size
+        self.max_seq_len = max_seq_len
+        self.max_pixels_per_sample = 128 * 384 * 384
+        super().__init__(image_processor, tokenizer, chat_template=chat_template)
+    def __call__(
+            self,
+            messages,
+            image_processing_config=None,
+            is_training=True,
+        ) -> torch.Tensor:
+        output_kwargs = self._merge_kwargs(
+            TarsierProcessorKwargs,
+            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+        )
+        # 【图片处理】
+        pixel_values, image_grid_thw = [], []
+        num_images = 0
+        for msg in messages:
+            for content in msg['content']:
+                if content['type'] == 'image':
+                    num_images += self.temporal_patch_size
+                elif content['type'] == 'video':
+                    num_images += len(content['video'])
+        if num_images > 0 and self.max_pixels_per_sample // num_images < image_processing_config['max_pixels']:
+            image_processing_config['max_pixels'] = self.max_pixels_per_sample // num_images
+            image_processing_config['min_pixels'] = min(image_processing_config['min_pixels'], image_processing_config['max_pixels'])
+        for msg in messages:
+            for content in msg['content']:
+                if content['type'] == 'image':
+                    content['image'] = self.preprocess_image(content['image'], image_processing_config)
+                    content['image'] = self.image_processor(images = content['image'], **output_kwargs["images_kwargs"], return_tensors="pt")
+                    content['num_vision_tokens'] = self.get_num_vision_tokens(content)
+                    pixel_values.append(content['image']['pixel_values'])
+                    if 'image_grid_thw' in content['image']:
+                        image_grid_thw.extend(content['image']['image_grid_thw'])
+                elif content['type'] == 'video':
+                    content['video'] = self.preprocess_image(content['video'], image_processing_config)
+                    if isinstance(self.image_processor, Qwen2VLImageProcessor):
+                        content['video'] = self.image_processor(images = None, videos = content['video'], **output_kwargs["images_kwargs"], return_tensors="pt")
+                        pixel_values.append(content['video']['pixel_values_videos'])
+                    else:
+                        content['video'] = self.image_processor(images = content['video'], **output_kwargs["images_kwargs"], return_tensors="pt")
+                        pixel_values.append(content['video']['pixel_values'])
+                    if 'video_grid_thw' in content['video']:
+                        image_grid_thw.extend(content['video']['video_grid_thw'])
+                    content['num_vision_tokens'] = self.get_num_vision_tokens(content)
+        #【文本处理】
+        add_generation_prompt = (not is_training and messages[-1]['role'] != 'assistant')
+        strip_final_eos = (not is_training and messages[-1]['role'] == 'assistant')
+        text_inputs = self.tokenizer.apply_chat_template(
+            messages,
+            chat_template = self.chat_template,
+            tokenize=True,
+            tokenizer_kwargs = output_kwargs["text_kwargs"],
+            return_assistant_tokens_mask=True,
+            return_dict=True,
+            add_generation_prompt=add_generation_prompt,
+            strip_final_eos=strip_final_eos,
+        )
+        labels = [-100 if j == 0 else i for i, j in zip(text_inputs['input_ids'], text_inputs['assistant_masks'])]
+        labels = labels[:self.max_seq_len]
+        input_ids = text_inputs['input_ids'][:self.max_seq_len]
+        image_token_id = self.tokenizer.convert_tokens_to_ids(self.image_token)
+        if image_token_id in text_inputs['input_ids'][self.max_seq_len:]:
+            raise ValueError(f'Too long sequence! {len(text_inputs["input_ids"])}')
+        outputs = {
+            'input_ids': input_ids,
+            'labels': labels,
+            'num_images': num_images,
+        }
+        if len(pixel_values) > 0:
+            outputs['pixel_values'] = torch.cat(pixel_values, dim=0)
+        if len(image_grid_thw) > 0:
+            outputs['image_grid_thw'] = torch.stack(image_grid_thw)
+        return outputs
+    def preprocess_image(self, pil_img: Union[Image.Image, List[Image.Image]], image_processing_config):
+        if image_processing_config is None:
+            return pil_img
+        images = pil_img
+        if isinstance(pil_img, Image.Image):
+            images = [images]
+        if image_processing_config['do_crop']:
+            images = [self.centralcrop(img, rate=[4, 3]) for img in images]
+        if image_processing_config['do_padding']:
+            images = [self.expand2square(
+                img,
+                # tuple(int(x * 255) for x in self.processor.image_processor.image_mean)
+                tuple(int(x * 255) for x in [0, 0, 0])
+            ) for img in images]
+        if image_processing_config['do_resize']:
+            images = [self.resize2square(img) for img in images]
+        if image_processing_config.get('max_pixels'):
+            images = [self.resize2pixels(
+                img,
+                int(image_processing_config['max_pixels']),
+                int(image_processing_config['min_pixels'])
+            ) for img in images]
+        if isinstance(pil_img, Image.Image):
+            images = images[0]
+        return images
+    def expand2square(self, pil_img, background_color):
+        width, height = pil_img.size
+        if width == height:
+            return pil_img
+        elif width > height:
+            result = Image.new(pil_img.mode, (width, width), background_color)
+            result.paste(pil_img, (0, (width - height) // 2))
+            return result
+        else:
+            result = Image.new(pil_img.mode, (height, height), background_color)
+            result.paste(pil_img, ((height - width) // 2, 0))
+            return result
+    def resize2square(self, pil_img: Image.Image):
+        width, height = pil_img.size
+        pil_img = pil_img.resize((max(width, height), max(width, height)))
+        return pil_img
+    def centralcrop(self, pil_img: Image.Image, rate=[4, 3]):
+        width, height = pil_img.size
+        size = (width, height)
+        min_len = min(size)
+        longer_side = 0 if width >= height else 1
+        center = (width/2, height/2)
+        box = [0, 0, size[0], size[1]]
+        # if longer_side == 0:
+        #     box[0] = max(0, center[0] - 1/2*min_len/rate[1]*rate[0])
+        #     box[2] = min(width, center[0] + 1/2*min_len/rate[1]*rate[0])
+        # else:
+        #     box[1] = max(0, center[1] - 1/2*min_len/rate[1]*rate[0])
+        #     box[3] = min(height, center[1] + 1/2*min_len/rate[1]*rate[0])
+        box[longer_side] = max(0, center[longer_side] - 1/2*min_len/rate[1]*rate[0])
+        box[2 + longer_side] = min(size[longer_side], center[longer_side] + 1/2*min_len/rate[1]*rate[0])
+        # box = (width/2-min_len/2, height/2-min_len/2, width/2+min_len/2, height/2+min_len/2)
+        pil_img = pil_img.crop(box)
+        return pil_img
+    def resize2pixels(self, pil_img: Image.Image, max_pixels=None, min_pixels=None):
+        width, height = pil_img.size
+        new_height, new_width = smart_resize(height, width, factor=1, max_pixels=max_pixels, min_pixels=min_pixels)
+        pil_img = pil_img.resize((new_width, new_height))
+        return pil_img
+    def get_num_vision_tokens(self, content):
+        if isinstance(self.image_processor, Qwen2VLImageProcessor):
+            merge_length = self.image_processor.merge_size**2
+            if content['type'] == 'image':
+                num_image_tokens = content['image']['image_grid_thw'].prod() // merge_length
+            else:
+                num_image_tokens = content['video']['video_grid_thw'].prod() // merge_length
+            return num_image_tokens
+        else:
+            # 其他模型：image tokens (-> 2x2 compressed) -> add image_newline and image_new
+            k = 'image'if content['type'] == 'image' else 'video'
+            pixel_values = content[k]['pixel_values'][0]
+            n_frames = len(content[k]['pixel_values'])
+            height, width = get_image_size(to_numpy_array(pixel_values))
+            num_image_tokens = (height // (self.patch_size * self.merge_size)) * (width // (self.patch_size * self.merge_size) + 1) + 1
+            return num_image_tokens * n_frames
+    def batch_decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
+        refer to the docstring of this method for more information.
+        """
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args, **kwargs):
+        """
+        This method forwards all its arguments to Qwen2TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
+        the docstring of this method for more information.
+        """
+        return self.tokenizer.decode(*args, **kwargs)
+    @property
+    def model_input_names(self):
+        tokenizer_input_names = self.tokenizer.model_input_names
+        image_processor_input_names = self.image_processor.model_input_names
+        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

eval_scripts/DREAM-1K/tarsier/dataset/utils.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List
+import os
+from PIL import Image, ImageSequence
+import decord
+VALID_DATA_FORMAT_STRING = "Input data must be {'.jpg', '.jpeg', '.png', '.tif'} for image; or {'.mp4', '.avi', '.webm', '.mov', '.mkv', '.wmv', '.gif'}  for videos!"
+# 均匀抽帧，必采样首尾帧。
+def sample_frame_indices(start_frame, total_frames: int, n_frames: int):
+    if n_frames == 1:
+        return [0]  # sample first frame in default
+    sample_ids = [round(i * (total_frames - 1) / (n_frames - 1)) for i in range(n_frames)]
+    sample_ids = [i + start_frame for i in sample_ids]
+    return sample_ids
+def sample_video(
+    video_path: str,
+    n_frames: int = None,
+    start_time: int = 0,
+    end_time: int = -1
+    ) -> List[Image.Image]:
+    assert os.path.exists(video_path), f"File not found: {video_path}"
+    vr = decord.VideoReader(video_path, num_threads=1, ctx=decord.cpu(0))
+    vr.seek(0)
+    total_frames = len(vr)
+    fps = vr.get_avg_fps()
+    start_frame = 0
+    end_frame = total_frames - 1
+    if start_time > 0:
+        start_frame = min((total_frames-1), int(fps*start_time))
+    if end_time > 0:
+        end_frame = max(start_frame, int(fps*end_time))
+        end_frame = min(end_frame, (total_frames-1))
+    frame_indices = sample_frame_indices(
+        start_frame=start_frame,
+        total_frames=end_frame - start_frame + 1,
+        n_frames=n_frames,
+    )
+    frames = vr.get_batch(frame_indices).asnumpy()
+    frames = [Image.fromarray(f).convert('RGB') for f in frames]
+    return frames
+def sample_gif(
+        gif_path: str,
+        n_frames:int = None,
+        start_time: int = 0,
+        end_time: int = -1
+    ) -> List[Image.Image]:
+    assert os.path.exists(gif_path), f"File not found: {gif_path}"
+    gif_frames = Image.open(gif_path)
+    start_frame = 0
+    end_frame = gif_frames.n_frames - 1
+    frame_indices = sample_frame_indices(
+        start_frame=start_frame,
+        total_frames=end_frame - start_frame + 1,
+        n_frames=n_frames,
+    )
+    frames = []
+    i = 0
+    for frame in ImageSequence.Iterator(gif_frames):
+        if i in frame_indices:
+            frames.append(frame.convert('RGB'))
+        i += 1
+    return frames
+def sample_image(
+    image_path: str,
+    n_frames: int = None,
+    start_time: int = 0,
+    end_time: int = -1
+    ):
+    assert os.path.exists(image_path), f"File not found: {image_path}"
+    image = Image.open(image_path).convert('RGB')
+    return [image]
+def get_visual_type(input_file):
+    ext = os.path.splitext(input_file)[-1]
+    if ext in {'.gif'}:
+        return 'gif'
+    elif ext in {'.mp4', '.avi', '.webm', '.mov', '.mkv', '.wmv'}:
+        return 'video'
+    elif ext in {'.jpg', '.jpeg', '.png', '.tif'}:
+        return 'image'
+    else:
+        print(f"{VALID_DATA_FORMAT_STRING} But found {ext}!")
+        return 'unk'
+def get_benchmarks(benchmarks):
+    final_benchmarks = []
+    type2bm = {
+        'dream': ['dream'],
+        'caption': ['msvd-caption', 'msr-vtt-caption', 'vatex-caption'],
+        'mc_qa': ['next-qa', 'egoschema', 'mvbench', 'video-mme'],
+        'oe_qa': ['msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa'],
+    }
+    for bm in benchmarks:
+        bm = bm.lower()
+        if bm in final_benchmarks:
+            continue
+        if bm == 'all':
+            for v in type2bm.values():
+                final_benchmarks.extend(v)
+            return final_benchmarks
+        if bm in type2bm:
+            final_benchmarks.extend(type2bm[bm])
+        else:
+            final_benchmarks.append(bm)
+    return final_benchmarks
+def check_data_format(data):
+    for msg in data['messages']:
+        if isinstance(msg['content'], dict):
+            msg['content'] = [msg['content']]
+        for content in msg['content']:
+            assert content['type'] in {'image', 'video', 'text'}, f"content['type']={content['type']} MUST be one of ['image', 'video', 'text']"
+            if content['type'] != "text":
+                media_path_key = f"{content['type']}_file"
+                meida_paths = content[content['type']][media_path_key]
+                if isinstance(meida_paths, str):
+                    meida_paths = [meida_paths]
+                for path in meida_paths:
+                    assert os.path.exists(path), f"File not found: {path}"
+def format_one_sample(media_file=None, prompt="Describe the video in detail."):
+    sample = {
+        "messages": []
+    }
+    user_content = {
+        "role": "user",
+        "content": []
+    }
+    if media_file is not None:
+        media_type = get_visual_type(media_file)
+        if media_type in ("video", "gif"):
+            media_type = "video"
+        media_path_key = f"{media_type}_file"
+        user_content["content"].append({
+            "type": media_type,
+            media_type: {
+                media_path_key: media_file,
+            }
+        })
+    user_content["content"].append({
+        "type": "text",
+        "text": prompt
+    })
+    assistant_content = {
+        "role": "assistant",
+        "content": []
+    }
+    sample["messages"].append(user_content)
+    sample["messages"].append(assistant_content)
+    if media_file is not None:
+        sample["task"] = f"{media_type}/QA"
+    else:
+        sample["task"] = 'text-only'
+    check_data_format(sample)
+    return sample
+class DictToObject(object):
+    def __init__(self, dictionary):
+        for key, value in dictionary.items():
+            setattr(self, key, value)

eval_scripts/DREAM-1K/tarsier/evaluation/evaluate.py ADDED Viewed

	@@ -0,0 +1,177 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import random
+from .metrics import CIDErMetric, GPTMetric, DREAMGPTMetric, AccuracyMetric, VideoMMEAccuracyMetric
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.rw_utils import read_jsonlines
+from tools.color import Color
+from dataset.utils import get_benchmarks
+def extract_item_for_eval(data_dict):
+    item = {}
+    prompt, prediction, reference = [], [], []
+    for msg in data_dict['messages']:
+        for content in msg['content']:
+            if content['type'] == 'text':
+                if msg['role'] == 'user':
+                    prompt.append(content['text'])
+                elif msg['role'] == 'assistant':
+                    if content.get('reference'):
+                        reference.append(content['reference'])
+                        prediction.append(content['text'])
+                        # prediction.append(content['reference']) # debug
+    item['prompt'] = ''.join(prompt)
+    item['prediction'] = ''.join(prediction)
+    item['response'] = ''.join(reference)
+    item['dataset'] = data_dict['dataset']
+    item['idx'] = f"{data_dict['dataset']}_{data_dict['idx']}"
+    extra_info = data_dict.get('extra_info', None)
+    vid = data_dict.get('vid', None)
+    if vid is not None:
+        item['vid'] = vid
+    if extra_info:
+        item['events'] = extra_info.get('events', None)
+        item['extra_info'] = extra_info
+    if 'is_hard' in data_dict:
+        item['is_hard'] = data_dict['is_hard']
+    return item
+def read_dataset(path, dataset_name):
+    if os.path.isdir(path):
+        lines = []
+        for f in os.listdir(path):
+            if f.endswith('.jsonl'):
+                lines.extend(read_jsonlines(os.path.join(path, f)))
+    else:
+        lines = read_jsonlines(path)
+    dataset = []
+    idxs = set()
+    for l in lines:
+        if l['dataset'].split('/')[0] != dataset_name:
+            continue
+        idx = f"{l['dataset']}_{l['idx']}"
+        if idx in idxs:
+            continue
+        idxs.add(idx)
+        item = extract_item_for_eval(l)
+        dataset.append(item)
+    return dataset
+METRIC_MAPPING = {
+    'CIDErMetric': CIDErMetric,
+    'GPTMetric': GPTMetric,
+    'AccuracyMetric': AccuracyMetric,
+    'DREAMGPTMetric': DREAMGPTMetric,
+    'VideoMMEAccuracyMetric': VideoMMEAccuracyMetric
+}
+def evaluate(pred_file, METRIC, dataset_name, sample_num=-1, verbose = False):
+    dataset = read_dataset(pred_file, dataset_name)
+    if len(dataset) == 0:
+        return
+    if sample_num > 0:
+        dataset = random.sample(dataset, sample_num)
+    metric = METRIC(dataset_name=dataset_name, verbose=verbose)
+    metric.process(dataset)
+    metric.summarize_metric()
+    metric.save_results(pred_file)
+    if isinstance(metric, DREAMGPTMetric):
+        metric.save_eval_infos(pred_file)
+def evaluate_all(pred_file, METRIC2DATASET, sample_num=-1, verbose = False):
+    for METRIC, dataset_name in METRIC2DATASET:
+        if isinstance(METRIC, str):
+            METRIC = METRIC_MAPPING[METRIC]
+        print(f"### Start Evaluating on {dataset_name}")
+        evaluate(pred_file, METRIC, dataset_name, sample_num, verbose)
+        print(f"### Finish Evaluating on {dataset_name}")
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--pred_file', type=str)
+    parser.add_argument('--benchmarks', nargs='+', default=["all"], help="Default as 'all' to evaluate on all benchmarks; Also could be task types: ('dream', 'caption', 'mc_qa', 'oe_qa'); And specific benchmark names: ('dream', 'msvd-caption', 'msr-vtt-caption', 'vatex-caption', 'next-qa', 'egoschema', 'mvbench', 'tvbench', 'video-mme', 'msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa', 'favor-bench')")
+    parser.add_argument('--sample_num', type=int, default=-1)
+    parser.add_argument('--verbose', action='store_true')
+    args = parser.parse_args()
+    args.benchmarks = get_benchmarks(args.benchmarks)
+    print("### Selected Benchmarks:", args.benchmarks)
+    Benchmark2Metric = {
+        # Multi-chocie QA
+        'next-qa': 'AccuracyMetric',
+        'egoschema': 'AccuracyMetric',
+        'mvbench': 'AccuracyMetric',
+        'tvbench': 'AccuracyMetric',
+        'video-mme': 'VideoMMEAccuracyMetric',
+        'favor-bench': 'AccuracyMetric',
+        # Open-ended QA
+        'msvd-qa': 'GPTMetric',
+        'msr-vtt-qa': 'GPTMetric',
+        'tgif-qa': 'GPTMetric',
+        'anet-qa': 'GPTMetric',
+        # Caption DREAM
+        'dream': 'DREAMGPTMetric',
+        # Caption CIDEr
+        'msvd-caption': 'CIDErMetric',
+        'msr-vtt-caption': 'CIDErMetric',
+        'vatex-caption': 'CIDErMetric',
+    }
+    Benchmark2Dataset = {
+        'dream': 'DREAM',
+        'next-qa': 'Next-QA-val-multi_choice',
+        'egoschema': 'EgoSchema',
+        'mvbench': 'MVBench',
+        'tvbench': 'TVBench',
+        'video-mme': 'Video-MME',
+        'favor-bench': 'FAVOR-Bench',
+        'msvd-qa': 'MSVD-QA-val',
+        'msr-vtt-qa': 'MSR-VTT-QA-val',
+        'tgif-qa': 'TGIF-QA-test',
+        'anet-qa': 'ActivityNet-QA-test',
+        'msvd-caption': 'MSVD-Caption-test',
+        'msr-vtt-caption': 'MSR-VTT-Caption-test',
+        'vatex-caption': 'VATEX-test',
+    }
+    METRIC2DATASET = []
+    for bm in args.benchmarks:
+        if bm not in Benchmark2Metric:
+            print(Color.red(f"Unknown benchmark: {bm}"))
+            continue
+        METRIC2DATASET.append([Benchmark2Metric[bm], Benchmark2Dataset[bm]])
+    evaluate_all(args.pred_file, METRIC2DATASET, args.sample_num, args.verbose)
+    # python3 -m evaluation.evaluate --pred_file $pred_file --sample_num=100

eval_scripts/DREAM-1K/tarsier/evaluation/metrics/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+from .evaluate_caption_cider import CIDErMetric
+from .evaluate_qa_oe_gpt import GPTMetric
+from .evaluate_qa_mc import AccuracyMetric
+from .evaluate_dream_gpt import DREAMGPTMetric
+from .evaluate_video_mme import VideoMMEAccuracyMetric

eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_caption_cider.py ADDED Viewed

	@@ -0,0 +1,82 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+from typing import List, Dict
+import os
+from pycocoevalcap.cider.cider import Cider
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.ptbtokenizer import PTBTokenizer
+from tools.color import Color
+class CIDErMetric:
+    def __init__(self, dataset_name, verbose=False) -> None:
+        self.dataset_name = dataset_name
+        self.tokenizer = PTBTokenizer()
+        self.scorer = Cider()
+        self.score = None
+        self.results = []
+        self.dataset = []
+        self.verbose = verbose
+    def add(self, data):
+        self.dataset.append(data)
+    def process(self, dataset: List[Dict]):
+        references, predictions = {}, {}
+        for i, data in enumerate(dataset):
+            ref = data['response']
+            pred = data['prediction']
+            if isinstance(ref, str):
+                ref = [ref]
+            references[i] = [{'caption': r.lower()} for r in ref]
+            predictions[i] = [{'caption': pred.lower()}]
+        references = self.tokenizer.tokenize(references)
+        predictions = self.tokenizer.tokenize(predictions)
+        score, scores = self.scorer.compute_score(references, predictions)
+        self.score = score
+        for data, s in zip(dataset, scores):
+            self.results.append({
+                'score': s,
+                'data': data,
+            })
+    def summarize_metric(self):
+        if self.verbose:
+            for result in self.results:
+                print(Color.blue(json.dumps(result['data'])))
+                print(Color.red(f"CIDEr score: {result['score']}"))
+        print(f'=====Evaluation Summary=====')
+        self.eval_records = [
+            f'Dataset: {self.dataset_name}\tMetric: CIDEr',
+            f'#Successful Results: {len(self.results)}',
+            f'CIDEr score: {round(self.score*100, 1)}'
+        ]
+        for info in self.eval_records:
+            print(info)
+    def save_results(self, pred_path):
+        if os.path.isdir(pred_path):
+            output_dir = os.path.join(pred_path, 'eval_records')
+        else:
+            output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+        os.makedirs(output_dir, exist_ok=True)
+        fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+        for info in self.eval_records:
+            fout.write(info+'\n')
+        fout.close()

eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_dream_gpt.py ADDED Viewed

	@@ -0,0 +1,436 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import numpy as np
+import ast
+import time
+from typing import List, Dict
+from tqdm import tqdm
+from pathos.multiprocessing import ProcessingPool as Pool
+import func_timeout
+from func_timeout import func_set_timeout
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+import re
+import os
+from copy import deepcopy
+from traceback import format_exc
+try:
+    with open("apikey.txt", "r") as f:
+        api_key = f.read()
+except:
+    api_key = ''
+def call_gpt35(msg):
+    while True:
+        try:
+            response = openai.ChatCompletion.create(
+                model="gpt-3.5-turbo",
+                messages=msg,
+                api_key=api_key,
+                request_timeout=5)
+            break
+        except:
+            print("Timeout, retrying...")
+            time.sleep(5)
+    output_text = response['choices'][0]['message']['content']
+    return output_text
+def count_f1(r, p):
+    return 2*r*p/(r+p)
+def call_azure_gpt_api(events, reference, prediction, model):
+    if len(events) == 0:
+        events = [reference.replace('\n', ' ')]
+    messages=[
+            {
+                "role": "user",
+                "content":
+                        "Given a video description and a list of events. For each event, classify the relationship between the video description and the event into three classes: entailment, neutral, contradiction.\n"
+                        "- \"entailment\" means that the video description entails the event.\n"
+                        "- \"contradiction\" means that some detail in the video description contradicts with the event.\n"
+                        "- \"neutral\" means that the relationship is neither \"entailment\" or \"contradiction\".\n\n"
+                        f"Video Description:\n{prediction}\n\n"
+                        f"Events: {events}\n"
+                        "Output a JSON formed as:\n"
+                        "{\n"
+                        "  \"events\": [\n"
+                        "    {\"event\": \"copy an event here\", \"relationship\": \"put class name here\",  \"reason\": \"give your reason here\"},\n"
+                        "    ...\n"
+                        "  ]\n"
+                        "}\n\n"
+                        "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only output the JSON. Output:"
+            }
+        ]
+    completion = call_gpt35(messages)
+    return completion
+def call_azure_gpt_api_for_events(caption, model):
+    messages=[
+            {
+                "role": "user",
+                "content":
+                        "Bellow is a description of a video clip:\n"
+                        f"Video Description: {caption}\n\n"
+                        "Extract at most 10 key events from the above video description paragraph. Requirements\n:"
+                        "- An event must include an action, motion or movement (NOT STATIC INFORMATION). DON'T repeat same events.\n"
+                        "- Every event is represented by a brief sentence within 10 words, with a subject, a predicate and optionally an object, avoid unnecessary appearance descriptions.\n"
+                        "- Every event must be atomic, meaning that it cannot be further split into multiple events.\n"
+                        "- Scene cuts and camera motions are NOT events.\n"
+                        "- Substitute pronouns by the nouns they refer to.\n\n"
+                        "Please generate the response in the form of a Python dictionary string with keys \"events\". The value of \"events\" is a List(str), of which each item is an event. "
+                        "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+                        "For example, your response should look like this: {\"events\": [event1, event2, ...]}"
+            }
+        ]
+    completion = call_gpt35(messages)
+    return completion
+def try_call_api_for_eval(events, answer, prediction, model, verbose=False, max_retry=100):
+    for i in range(max_retry):
+        gpt_q = call_azure_gpt_api(events, answer, prediction, model)
+        if gpt_q is not None:
+            gpt_q = gpt_q.strip()
+            gpt_q = re.sub(r'\n+', '\n', gpt_q)
+            gpt_q = re.sub(r'\s+', ' ', gpt_q)
+            if gpt_q.startswith("```json"):
+                gpt_q = gpt_q.replace("```json", "").replace("```", "").strip()
+            elif gpt_q.startswith("```python"):
+                gpt_q = gpt_q.replace("```python", "").replace("```", "").strip()
+            if not gpt_q.startswith('{'):
+                gpt_q = '{' + gpt_q
+            if not gpt_q.endswith('}'):
+                gpt_q = gpt_q + '}'
+            gpt_q = gpt_q.replace("True", "true").replace("False", "false")
+            gpt_q = gpt_q.replace("} {", "}, {").replace("}{", "}, {")
+            gpt_q = gpt_q.replace(",\n}", "\n}").replace(", \n}", "\n}").replace(", }", "}").replace(",}", "}")
+            gpt_q = gpt_q.replace(",\n]", "\n]").replace(", \n]", "\n]").replace(", ]", "]").replace(",]", "]")
+            gpt_q = gpt_q.replace("[Placeholder]", "null")
+            gpt_q = gpt_q.replace("{Events:", "").strip()
+            return gpt_q, True
+    return f"Exceed max try: {max_retry}", False
+def try_call_api_for_events(caption, model, verbose=False):
+    for i in range(100):
+        gpt_q = call_azure_gpt_api_for_events(caption, model)
+        if gpt_q is not None:
+            if gpt_q.startswith("```json"):
+                gpt_q = gpt_q.replace("```json", "").replace("```", "").strip()
+            elif gpt_q.startswith("```python"):
+                gpt_q = gpt_q.replace("```python", "").replace("```", "").strip()
+            return gpt_q, True
+    return "Exceed max try: 5", False
+def extract_events(inputs, is_pred=False, max_retry=100):
+    data, model, verbose = inputs
+    if is_pred:
+        caption = data['prediction'].lower()
+    else:
+        caption = data['response'].lower()
+    caption = caption.replace("\"", "\'")
+    retry = 0
+    while True and (retry<max_retry or max_retry<0):
+        retry += 1
+        result, success = try_call_api_for_events(caption, model, verbose)
+        if not success:
+            print(f"[error]: try_call_api_for_events failed!", flush=True)
+            continue
+        try:
+            result = ast.literal_eval(result)
+            events = result['events']
+            if verbose:
+                print("pred_events=" if is_pred else "gt events=", events, ":", caption)
+            assert isinstance(events, list) and (len(events)==0 or isinstance(events[0], str))
+            return events
+        except Exception as e:
+            print(format_exc(), flush=True)
+            continue
+    print("[error]: Exceed max_retry!", flush=True)
+    raise ValueError("[error]: Exceed max_retry!")
+def evaluate_one_sample(events, response, prediction, model, verbose, return_hit_num=False, is_recall=False, max_retry=100):
+    retry = 0
+    while True and (retry<max_retry or max_retry<0):
+        retry += 1
+        try:
+            assert isinstance(events, list)
+            result = None
+            result, success = try_call_api_for_eval(events, response, prediction, model, verbose)
+            if not success:
+                print("[error]: try_call_api_for_eval failed!", flush=True)
+                continue
+            try:
+                events_filled = json.loads(result)
+                events_filled = events_filled['events']
+            except Exception as e:
+                print("load json failed:", result)
+                continue
+            assert len(events) == len(events_filled) or (len(events) == 0 and len(events_filled) == 1)
+            num_matched_events = 0
+            try:
+                for event in events_filled:
+                    pred = event['relationship'].strip().lower()
+                    assert pred in ['entailment', 'neutral', 'contradiction']
+                    pos_classes = ['entailment'] if is_recall else ['entailment', 'neutral']
+                    if pred in pos_classes:
+                        num_matched_events += 1
+            except Exception as e:
+                print(f"Invalid response: {events_filled}")
+                continue
+            if len(events) == 0:
+                motion_score = 1.0
+            else:
+                motion_score = num_matched_events / len(events)
+            if return_hit_num:
+                return motion_score, events_filled, f"hit: {num_matched_events} / {len(events)}"
+            return motion_score
+        except Exception as e:
+            print(format_exc(), flush=True)
+            continue
+        time.sleep(1)
+    print("[error]: Exceed max_retry!", flush=True)
+    raise ValueError(f"[error]: Exceed max_retry!")
+def process_one_sample(inputs):
+    data, model, verbose = inputs
+    response, prediction = data['response'].lower(), data['prediction'].lower()
+    result = None
+    try:
+        if isinstance(data.get('events', None), list):
+            gt_events = data['events']
+        else:
+            gt_events = extract_events(inputs, is_pred=False)
+        pred_events = extract_events(inputs, is_pred=True)
+        assert isinstance(gt_events, list) and isinstance(pred_events, list)
+        result = {}
+        motion_score_r, events_filled_r, hit_num_r = evaluate_one_sample(gt_events, response, prediction, model, verbose, return_hit_num=True, is_recall=True)
+        motion_score_p, events_filled_p, hit_num_p = evaluate_one_sample(pred_events, prediction, response, model, verbose, return_hit_num=True, is_recall=True)
+        result['score_r'] = motion_score_r
+        result['score_p'] = motion_score_p
+        result['eval_infos'] = {
+            'idx': data['idx'],
+            'gt': response,
+            'pred': prediction,
+            'events_gt': events_filled_r,
+            'hit_num_recall': hit_num_r,
+            'events_pred': events_filled_p,
+            "hit_num_precision": hit_num_p,
+        }
+        if 'extra_info' in data:
+            result['extra_info'] = data['extra_info']
+    except Exception as e:
+        if verbose:
+            print(e)
+            print(f'invalid GPT response: {result}')
+        result = None
+        return {'success': False, 'result': result, 'data': data}
+    return {'success': True, 'result': result, 'data': data}
+class DREAMGPTMetric:
+    def __init__(self, dataset_name, verbose=False) -> None:
+        self.dataset_name = dataset_name
+        self.num_worker = 64
+        # self.model = 'gpt-35-turbo'
+        self.model = 'gpt-35-turbo-0125'
+        # self.model='gpt-4-1106-preview'
+        self.results = []
+        self.invalid_results = []
+        self.dataset = []
+        self.verbose = verbose
+        self.eval_infos = []
+        self.buckets = {
+            "subjects": {
+                '<=1': [], '==2': [], '==3': [], '>=4': []
+            },
+            "shots": {'<=1': [], '==2': [], '==3': [], '>=4': []
+            },
+            "events": {'<=3': [], 'in [4, 5]': [], 'in [6, 7]': [], '>=8': []
+            }
+        }
+    def add(self, data):
+        self.dataset.append(data)
+    def select_bucket(self, bucket_name, num):
+        for key in self.buckets[bucket_name]:
+            if eval(f"{num}{key}"):
+                return key
+        return ''
+    def add_to_bucket(self, bucket_name, data):
+        sub_bucket = self.select_bucket(bucket_name, data['result']['extra_info'][f'n_{bucket_name}'])
+        if sub_bucket:
+            self.buckets[bucket_name][sub_bucket].append(data)
+    def process(self, dataset: List[Dict]):
+        self._process_group_by_subtask(dataset)
+    def _process(self, dataset: List[Dict], subtask=None):
+        pool = Pool(processes = self.num_worker, )
+        inputs = [(d, self.model, self.verbose) for d in dataset]
+        results = pool.uimap(process_one_sample, inputs, chunksize = 1)
+        for result in tqdm(results, total = len(dataset), desc=f'eval {subtask}'):
+            if subtask:
+                result['subtask'] = subtask
+            self.update_metric(result)
+        pool.close()
+        pool.join()
+        pool.clear() # MUST
+    def _process_group_by_subtask(self, dataset: List[Dict]):
+        def _group_by_subtask(dataset):
+            subtasks = {}
+            for data in dataset:
+                if data['dataset'] not in subtasks:
+                    subtasks[data['dataset']] = []
+                subtasks[data['dataset']].append(data)
+            return subtasks
+        subtasks = _group_by_subtask(dataset)
+        for subtask, subdata in subtasks.items():
+            self._process(subdata, subtask)
+    def update_metric(self, result):
+        if result['success']:
+            self.results.append(result)
+        else:
+            self.invalid_results.append(result)
+    def summarize_metric(self):
+        self._summarize_metric_by_subtask()
+        self._summarize_metric_by_bucket()
+    def _summarize_metric_by_subtask(self):
+        from prettytable import PrettyTable
+        self.table = PrettyTable(['Task', 'F1 Score', 'Action Recall', 'Action Precision', 'Success', 'Failed'])
+        def _group_by_subtask():
+            sub_results = {}
+            sub_invalid_results = {}
+            for data in self.results:
+                if data['subtask'] not in sub_results:
+                    sub_results[data['subtask']] = []
+                sub_results[data['subtask']].append(data)
+            for data in self.invalid_results:
+                if data['subtask'] not in sub_invalid_results:
+                    sub_invalid_results[data['subtask']] = []
+                sub_invalid_results[data['subtask']].append(data)
+            return sub_results, sub_invalid_results
+        sub_results, sub_invalid_results = _group_by_subtask()
+        overall_avg_recall = []
+        overall_avg_precision = []
+        subtasks = list(sub_results.keys())
+        subtasks.sort()
+        for subtask in subtasks:
+            sub_rsts = sub_results[subtask]
+            sub_in_rsts = sub_invalid_results.get(subtask, [])
+            recalls = []
+            precisions = []
+            for result in sub_rsts:
+                r, p, infos = result['result']['score_r'], result['result']['score_p'], result['result']['eval_infos']
+                recalls.append(r)
+                precisions.append(p)
+                self.eval_infos.append(infos)
+            avg_recall = np.average(recalls)
+            avg_precision = np.average(precisions)
+            f1 = count_f1(avg_recall, avg_precision)
+            overall_avg_recall.append(avg_recall)
+            overall_avg_precision.append(avg_precision)
+            task_name = subtask
+            self.table.add_row([task_name, round(f1, 3), round(avg_recall, 3), round(avg_precision, 3), len(sub_rsts), len(sub_in_rsts)])
+        overall_recall = np.average(overall_avg_recall)
+        overall_precision = np.average(overall_avg_precision)
+        overall_f1 = count_f1(overall_recall, overall_precision)
+        self.table.add_row(['OVERALL', round(overall_f1, 3), round(overall_recall, 3), round(overall_precision, 3), len(self.results), len(self.invalid_results)])
+        print(f'=====DREAM Evaluation Summary=====')
+        print(self.table)
+    def _summarize_metric_by_bucket(self):
+        from prettytable import PrettyTable
+        self.bucket_tables = []
+        for bucket in self.buckets:
+            table = PrettyTable(['Score'] + list(self.buckets[bucket].keys()))
+            for data in self.results:
+                self.add_to_bucket(bucket_name=bucket, data=data)
+            bucket_result = {}
+            for sub_bucket in self.buckets[bucket]:
+                recalls = []
+                precisions = []
+                for result in self.buckets[bucket][sub_bucket]:
+                    r, p = result['result']['score_r'], result['result']['score_p']
+                    recalls.append(r)
+                    precisions.append(p)
+                avg_recall = np.average(recalls)
+                avg_precision = np.average(precisions)
+                f1 = count_f1(avg_recall, avg_precision)
+                bucket_result[sub_bucket] = (avg_recall, avg_precision, f1)
+            raw = []
+            scores = ['Recall', 'Precision', 'F1']
+            for i in range(len(scores)):
+                raw = [scores[i]]
+                for sub_bucket in bucket_result:
+                    raw.append(round(bucket_result[sub_bucket][i], 3))
+                table.add_row(raw)
+            sample_num = ['Count']
+            for k in self.buckets[bucket]:
+                sample_num.append(len(self.buckets[bucket][k]))
+            table.add_row(sample_num)
+            bucket_info = f'\n=====DREAM Evaluation Split by Bucket #{bucket}====='
+            print(bucket_info)
+            print(table)
+            self.bucket_tables.append(bucket_info)
+            self.bucket_tables.append(deepcopy(table))
+    def save_results(self, pred_path):
+        if os.path.isdir(pred_path):
+            output_dir = os.path.join(pred_path, 'eval_records')
+        else:
+            output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+        os.makedirs(output_dir, exist_ok=True)
+        model_flag = os.path.basename(pred_path).split('.')[0]
+        fout = open(os.path.join(output_dir, f'{self.dataset_name}_{model_flag}_eval_result.txt'), 'w')
+        print(self.table, file=fout)
+        for bucket_info in self.bucket_tables:
+            print(bucket_info)
+        fout.close()
+    def save_eval_infos(self, pred_path):
+        if os.path.isdir(pred_path):
+            output_dir = os.path.join(pred_path, 'eval_records')
+        else:
+            output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+        os.makedirs(output_dir, exist_ok=True)
+        model_flag = os.path.basename(pred_path).split('.')[0]
+        fout = open(os.path.join(output_dir, f'DREAM_{model_flag}_eval_infos.jsonl'), 'w')
+        for info in self.eval_infos:
+            fout.write(json.dumps(info) +'\n')
+        fout.close()
+        print(f"DREAM evaluation information saved in: {os.path.join(output_dir, 'DREAM_eval_infos.jsonl')}", flush=True)

eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_mc.py ADDED Viewed

	@@ -0,0 +1,159 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import numpy as np
+import os
+from typing import List, Dict
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.color import Color
+class AccuracyMetric:
+    def __init__(self, dataset_name, verbose=False) -> None:
+        self.dataset_name = dataset_name
+        self.results = []
+        self.invalid_results = []
+        self.dataset = []
+        self.verbose = verbose
+    def add(self, data):
+        self.dataset.append(data)
+    def process(self, dataset: List[Dict]):
+        if self.dataset_name in ['MVBench', 'TVBench', 'FAVOR-Bench']:
+            return self._process_group_by_subtask(dataset)
+        else:
+            return self._process(dataset)
+    def _process(self, dataset: List[Dict], subtask=None):
+        for data in dataset:
+            prompt, response, prediction = data['prompt'], data['response'], data['prediction']
+            prediction = prediction.replace('(', '').replace(')', '').strip()
+            response = response.replace('(', '').replace(')', '').strip()[0]
+            if len(prediction) <= 0:
+                success = False
+            else:
+                prediction = prediction[0]
+                if '0'<=prediction<='5':
+                    prediction = chr(int(prediction) + ord('A'))
+                success = prediction.isupper() and prediction.isalpha() and len(prediction) == 1
+            if success:
+                rst = {
+                    'success': success,
+                    'data': data,
+                    'result': {'acc': response == prediction}
+                }
+                if subtask:
+                    rst['subtask'] = subtask
+                self.results.append(rst)
+            else:
+                rst = {
+                    'success': success,
+                    'data': data,
+                    'result': {'acc': response == prediction}
+                }
+                if subtask:
+                    rst['subtask'] = subtask
+                self.invalid_results.append(rst)
+    def _process_group_by_subtask(self, dataset: List[Dict]):
+        def _group_by_subtask(dataset):
+            subtasks = {}
+            for data in dataset:
+                if data['dataset'] not in subtasks:
+                    subtasks[data['dataset']] = []
+                subtasks[data['dataset']].append(data)
+            return subtasks
+        subtasks = _group_by_subtask(dataset)
+        for subtask, subdata in subtasks.items():
+            self._process(subdata, subtask)
+    def summarize_metric(self):
+        if self.dataset_name in ['MVBench', 'TVBench', 'FAVOR-Bench']:
+            return self._summarize_metric_by_subtask()
+        else:
+            return self._summarize_metric()
+    def _summarize_metric(self):
+        if self.verbose:
+            for result in self.results + self.invalid_results:
+                print(f"{Color.red('Success: ' + str(result['success']))}")
+                print(Color.blue(json.dumps(result['data'], ensure_ascii=False)))
+                print(f"{Color.green('Accuracy: ' + str(result['result']['acc']))}")
+        accs = []
+        for result in self.results:
+            acc = result['result']['acc']
+            accs.append(acc)
+        avg_acc = np.average(accs)
+        self.eval_records = [
+            f'=====Evaluation Summary=====',
+            f'Dataset: {self.dataset_name}\tMetric: Accuracy',
+            f'#Successful Results: {len(self.results)}\n#Failed Results: {len(self.invalid_results)}',
+            f'Accuracy: {round(avg_acc*100, 1)}',
+        ]
+        for info in self.eval_records:
+            print(info)
+    def _summarize_metric_by_subtask(self):
+        from prettytable import PrettyTable
+        self.table = PrettyTable(['Task','Accuracy','Success','Failed'])
+        def _group_by_subtask():
+            sub_results = {}
+            sub_invalid_results = {}
+            for data in self.results:
+                if data['subtask'] not in sub_results:
+                    sub_results[data['subtask']] = []
+                sub_results[data['subtask']].append(data)
+            for data in self.invalid_results:
+                if data['subtask'] not in sub_invalid_results:
+                    sub_invalid_results[data['subtask']] = []
+                sub_invalid_results[data['subtask']].append(data)
+            return sub_results, sub_invalid_results
+        sub_results, sub_invalid_results = _group_by_subtask()
+        oa_accs = []
+        subtasks = list(sub_results.keys())
+        # subtasks.sort(key=lambda x:f"{x.split('/')[-1].split(' ')[0][0]}{x.split('/')[-1].split(' ')[1][0]}")
+        subtasks.sort(key=lambda x:x.split('/')[-1])
+        for subtask in subtasks:
+            sub_rsts = sub_results[subtask]
+            sub_in_rsts = sub_invalid_results.get(subtask, [])
+            accs = []
+            for result in sub_rsts:
+                acc = result['result']['acc']
+                accs.append(acc)
+                oa_accs.append(acc)
+            avg_acc = np.average(accs)
+            # task_name = f"{subtask.split('/')[-1].split(' ')[0][0]}{subtask.split('/')[-1].split(' ')[1][0]}"
+            task_name = subtask.split('/')[-1]
+            self.table.add_row([task_name, round(avg_acc*100, 1), len(sub_rsts), len(sub_in_rsts)])
+        self.table.add_row(['OVERALL', round(np.average(oa_accs)*100, 1), len(self.results), len(self.invalid_results)])
+        print(f'=====Evaluation Summary=====')
+        print(self.table)
+    def save_results(self, pred_path):
+        if os.path.isdir(pred_path):
+            output_dir = os.path.join(pred_path, 'eval_records')
+        else:
+            output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+        os.makedirs(output_dir, exist_ok=True)
+        fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+        if self.dataset_name in ['MVBench', 'TVBench', 'FAVOR-Bench']:
+            print(self.table, file=fout)
+        else:
+            for info in self.eval_records:
+                fout.write(info+'\n')
+        fout.close()

eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_qa_oe_gpt.py ADDED Viewed

	@@ -0,0 +1,153 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+import numpy as np
+import ast
+import time
+from typing import List, Dict
+from tqdm import tqdm
+from pathos.multiprocessing import ProcessingPool as Pool
+import func_timeout
+from func_timeout import func_set_timeout
+import os
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')
+from tools.color import Color
+def call_azure_gpt_api(question, answer, prediction, model):
+    messages=[
+            {
+                "role": "system",
+                "content":
+                        "You are an intelligent chatbot designed for evaluating the correctness of generative outputs for question-answer pairs. "
+                        "Your task is to compare the predicted answer with the correct answer and determine if they match meaningfully. Here's how you can accomplish the task:"
+                        "------"
+                        "##INSTRUCTIONS: "
+                        "- Focus on the meaningful match between the predicted answer and the correct answer.\n"
+                        "- Consider synonyms or paraphrases as valid matches.\n"
+                        "- Evaluate the correctness of the prediction compared to the answer."
+            },
+            {
+                "role": "user",
+                "content":
+                        "Please evaluate the following video-based question-answer pair:\n\n"
+                        f"Question: {question}\n"
+                        f"Correct Answer: {answer}\n"
+                        f"Predicted Answer: {prediction}\n\n"
+                        "Provide your evaluation only as a yes/no and score where the score is an integer value between 0 and 5, with 5 indicating the highest meaningful match. "
+                        "Please generate the response in the form of a Python dictionary string with keys 'pred' and 'score', where value of 'pred' is  a string of 'yes' or 'no' and value of 'score' is in INTEGER, not STRING."
+                        "DO NOT PROVIDE ANY OTHER OUTPUT TEXT OR EXPLANATION. Only provide the Python dictionary string. "
+                        "For example, your response should look like this: {'pred': 'yes', 'score': 4.8}"
+            }
+        ]
+    completion = call_gpt35(messages)
+    return completion
+def try_call_api(question, answer, prediction, model, verbose=False):
+    for i in range(5):
+        gpt_q = call_azure_gpt_api(question, answer, prediction, model)
+        if gpt_q is not None:
+            return gpt_q, True
+    return None, False
+def process_one_sample(inputs):
+    data, model, verbose = inputs
+    prompt, response, prediction = data['question'], data['response'].lower(), data['prediction'].lower()
+    result = None
+    try:
+        result, success = try_call_api(prompt, response, prediction, model, verbose)
+        if not success:
+            raise ValueError(result)
+        result = ast.literal_eval(result)
+        pred, score = result['pred'], result['score']
+        # check pred
+        if pred not in ['yes', 'no']:
+            raise ValueError()
+        # check score
+        result['score'] = float(result['score'])
+        if score < 0 or score > 5:
+            raise ValueError()
+    except Exception as e:
+        if verbose:
+            print(e)
+            print(f'invalid GPT response: {result}')
+        return {'success': False, 'result': result, 'data': data}
+    return {'success': True, 'result': result, 'data': data}
+class GPTMetric:
+    def __init__(self, dataset_name, verbose=False) -> None:
+        self.dataset_name = dataset_name
+        self.num_worker = 64
+        self.model = 'gpt-35-turbo-0125'
+        self.results = []
+        self.invalid_results = []
+        self.dataset = []
+        self.verbose = verbose
+    def add(self, data):
+        self.dataset.append(data)
+    def process(self, dataset: List[Dict]):
+        pool = Pool(processes = self.num_worker, )
+        inputs = [(d, self.model, self.verbose) for d in dataset]
+        results = pool.uimap(process_one_sample, inputs, chunksize = 1)
+        for result in tqdm(results, total = len(dataset)):
+            self.update_metric(result)
+        pool.close()
+        pool.join()
+        pool.clear() # MUST
+    def update_metric(self, result):
+        if result['success']:
+            self.results.append(result)
+        else:
+            self.invalid_results.append(result)
+    def summarize_metric(self):
+        if self.verbose:
+            for result in self.results + self.invalid_results:
+                print(f"Success: {Color.red(str(result['success']))}")
+                print(Color.blue(json.dumps(result['data'], ensure_ascii=False)))
+                print(Color.green(json.dumps(result['result'], ensure_ascii=False)))
+        preds, scores = [], []
+        for result in self.results:
+            pred, score = result['result']['pred'], result['result']['score']
+            preds.append(pred)
+            scores.append(score)
+        avg_score = np.average(scores)
+        acc = np.average([p == 'yes' for p in preds])
+        print(f'=====Evaluation Summary=====')
+        self.eval_records = [
+            f'Dataset: {self.dataset_name}\tMetric: GPT Accuracy',
+            f'#Successful Results: {len(self.results)}\n#Failed Results: {len(self.invalid_results)}',
+            f'Accuracy: {round(acc*100, 1)}',
+            f'Average Score: {round(avg_score, 3)}',
+        ]
+        for info in self.eval_records:
+            print(info)
+    def save_results(self, pred_path):
+        if os.path.isdir(pred_path):
+            output_dir = os.path.join(pred_path, 'eval_records')
+        else:
+            output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+        os.makedirs(output_dir, exist_ok=True)
+        fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+        for info in self.eval_records:
+            fout.write(info+'\n')
+        fout.close()

eval_scripts/DREAM-1K/tarsier/evaluation/metrics/evaluate_video_mme.py ADDED Viewed

	@@ -0,0 +1,358 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+from typing import List, Dict
+from typing import Optional, List, Union
+import os
+import sys
+sys.path.append('eval_scripts/DREAM-1K/tarsier')from tools.color import Color
+CATEGORIES = [
+    "Knowledge",
+    "Film & Television",
+    "Sports Competition",
+    "Artistic Performance",
+    "Life Record",
+    "Multilingual"
+]
+SUB_CATEGORIES = [
+    "Humanity & History",
+    "Literature & Art",
+    "Biology & Medicine",
+    "Finance & Commerce",
+    "Astronomy",
+    "Geography",
+    "Law",
+    "Life Tip",
+    "Technology",
+    "Animation",
+    "Movie & TV Show",
+    "Documentary",
+    "News Report",
+    "Esports",
+    "Basketball",
+    "Football",
+    "Athletics",
+    "Other Sports",
+    "Stage Play",
+    "Magic Show",
+    "Variety Show",
+    "Acrobatics",
+    "Handicraft",
+    "Food",
+    "Fashion",
+    "Daily Life",
+    "Travel",
+    "Pet & Animal",
+    "Exercise",
+    "Multilingual"
+]
+TASK_CATEGORIES = [
+    "Temporal Perception",
+    "Spatial Perception",
+    "Attribute Perception",
+    "Action Recognition",
+    "Object Recognition",
+    "OCR Problems",
+    "Counting Problem",
+    "Temporal Reasoning",
+    "Spatial Reasoning",
+    "Action Reasoning",
+    "Object Reasoning",
+    "Information Synopsis",
+]
+class VideoMMEAccuracyMetric:
+    def __init__(self, dataset_name, verbose=False) -> None:
+        self.dataset_name = dataset_name
+        self.results = []
+        self.invalid_results = []
+        self.dataset = []
+        self.verbose = verbose
+    def add(self, data):
+        self.dataset.append(data)
+    def process(self, dataset: List[Dict]):
+        return self._process(dataset)
+    def _process(self, dataset: List[Dict]):
+        for data in dataset:
+            prompt, response, prediction = data['prompt'], data['response'], data['prediction']
+            extra_info = data['extra_info']
+            prediction = prediction.replace('(', '').replace(')', '').strip()
+            if len(prediction) <= 0:
+                success = False
+            else:
+                prediction = prediction[0]
+                if '1'<=prediction<='5':
+                    prediction = chr(int(prediction) + ord('A'))
+                success = prediction.isupper() and prediction.isalpha() and len(prediction) == 1
+            if success:
+                rst = {
+                    'success': success,
+                    'data': data,
+                    'result': {'acc': response == prediction},
+                    'extra_info': extra_info,
+                    'missing': False
+                }
+                self.results.append(rst)
+            else:
+                rst = {
+                    'success': success,
+                    'data': data,
+                    'result': {'acc': False},
+                    'extra_info': extra_info,
+                    'missing': True
+                }
+                self.results.append(rst)
+                self.invalid_results.append(rst)
+    def summarize_metric(self):
+        if self.verbose:
+            for result in self.results + self.invalid_results:
+                print(f"{Color.red('Success: ' + str(result['success']))}")
+                print(Color.blue(json.dumps(result['data'], ensure_ascii=False)))
+                print(f"{Color.green('Accuracy: ' + str(result['result']['acc']))}")
+        print(f'=====Evaluation Summary=====')
+        print(f'Dataset: {self.dataset_name}\tMetric: Accuracy')
+        print(f'#Successful Results: {len(self.results) - len(self.invalid_results)}\n#Failed Results: {len(self.invalid_results)}')
+        self.eval_your_results(
+            video_types = ["short","medium","long"],
+            skip_missing = True,
+            return_categories_accuracy = True,
+            return_sub_categories_accuracy = False,
+            return_task_types_accuracy = False,
+        )
+    def merge_results(self):
+        results_merged_by_vid = {}
+        for result in self.results:
+            vid = result['extra_info']['vid']
+            if vid not in results_merged_by_vid:
+                results_merged_by_vid[vid] = {
+                    'video_id': vid,
+                    "duration": result['extra_info']['duration'],
+                    "domain": result['extra_info']['domain'],
+                    "sub_category": result['extra_info']['sub_category'],
+                    'questions': [],
+                    'missing': False
+                }
+            if result['missing']:
+                results_merged_by_vid[vid]['missing'] = True
+            results_merged_by_vid[vid]['questions'].append({
+                'qid': result['extra_info']['idx'],
+                'task_type': result['extra_info']['task_type'],
+                'acc': result['result']['acc']
+            }
+            )
+        return results_merged_by_vid
+    def eval_your_results(
+        self,
+        video_types: Optional[Union[List[str], str]] = None,
+        skip_missing: Optional[bool] = False,
+        return_categories_accuracy: Optional[bool] = True,
+        return_sub_categories_accuracy: Optional[bool] = False,
+        return_task_types_accuracy: Optional[bool] = False,
+        gt_answer_key: Optional[str] = "answer",
+        your_answer_key: Optional[str] = "response"
+    ):
+        """
+        This copy from https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py
+        Evaluate your results against the ground truth
+        Args:
+        - your_results_path (str): Path to your results file
+        - video_types (Optional[List[str], str]): List of video types to evaluate.
+        - skip_missing (Optional[bool]): If True, missing files will be skipped. If False, an error will be raised if there are missing files.
+        - return_categories_accuracy (Optional[bool]): If True, the accuracy for each video category will be returned.
+        - return_sub_categories_accuracy (Optional[bool]): If True, the accuracy for each video sub category will be returned.
+        - return_task_types_accuracy (Optional[bool]): If True, the accuracy for each task category will be returned.
+        - gt_answer_key (Optional[str]): Key to access the ground truth answer in the results file.
+        - your_answer_key (Optional[str]): Key to access your answer in the results file.
+        """
+        # Load your results
+        # with open(your_results_path, 'r') as f:
+        #     your_results = json.load(f)
+        your_results = list(self.merge_results().values())
+        self.eval_records = []
+        if isinstance(video_types, str):
+            video_types = video_types.split(",")
+        q_type_dict = {}
+        v_type_dict = {}
+        v_sub_type_dict = {}
+        for video_type in video_types:
+            # Filter your results based on video types
+            your_results_video_type = [item for item in your_results if item['duration'] == video_type]
+            # Task Categories
+            q_type_dict[video_type] = {}
+            for q_type in TASK_CATEGORIES:
+                q_type_dict[video_type][q_type] = {"correct": 0, "answered": 0}
+            # Video categories
+            v_type_dict[video_type] = {}
+            for v_type in CATEGORIES:
+                v_type_dict[video_type][v_type] = {"correct": 0, "answered": 0}
+            v_sub_type_dict[video_type] = {}
+            for v_sub_type in SUB_CATEGORIES:
+                v_sub_type_dict[video_type][v_sub_type] = {"correct": 0, "answered": 0}
+            if not skip_missing:
+                # Check if the number of files in your results and ground truth are the same
+                print(len(your_results_video_type))
+                assert len(your_results_video_type) == 300, f"Number of files in {video_type} is not 300. Check if there are missing files."
+            for item in your_results_video_type:
+                if skip_missing and item["missing"]:
+                    continue
+                # Get the video category, sub category and question category
+                video_category = item["domain"]
+                video_sub_category = item["sub_category"]
+                questions = item["questions"]
+                for question in questions:
+                    q_type = question["task_type"]
+                    # Get the ground truth and your response
+                    # gt_answer = question[gt_answer_key]
+                    # response = question[your_answer_key]
+                    acc = question['acc']
+                    # Extract the answer from the response
+                    # extration = extract_characters_regex(response)
+                    if acc is not None:
+                        q_type_dict[video_type][q_type]["answered"] += 1
+                        q_type_dict[video_type][q_type]["correct"] += acc
+                        v_type_dict[video_type][video_category]["answered"] += 1
+                        v_type_dict[video_type][video_category]["correct"] += acc
+                        v_sub_type_dict[video_type][video_sub_category]["answered"] += 1
+                        v_sub_type_dict[video_type][video_sub_category]["correct"] += acc
+        # Print the results for each video type
+        for video_type in video_types:
+            info = f"=====================================\nEvaluation on video Type: {video_type}\n====================================="
+            self.eval_records.append(info)
+            print(info)
+            if return_categories_accuracy:
+                info = f"-------------------------------------\nVideo Categories\n-------------------------------------"
+                self.eval_records.append(info)
+                print(info)
+                for v_type in v_type_dict[video_type]:
+                    info = f"{v_type}: {100 * v_type_dict[video_type][v_type]['correct'] / v_type_dict[video_type][v_type]['answered'] if v_type_dict[video_type][v_type]['answered'] > 0 else 0 : .1f}%"
+                    self.eval_records.append(info)
+                    print(info)
+            if return_sub_categories_accuracy:
+                info = f"-------------------------------------\nVideo Sub Categories\n-------------------------------------"
+                self.eval_records.append(info)
+                for v_sub_type in v_sub_type_dict[video_type]:
+                    info = f"{v_sub_type}: {100 * v_sub_type_dict[video_type][v_sub_type]['correct'] / v_sub_type_dict[video_type][v_sub_type]['answered'] if v_sub_type_dict[video_type][v_sub_type]['answered'] > 0 else 0 : .1f}%"
+                    self.eval_records.append(info)
+                    print(info)
+            if return_task_types_accuracy:
+                info = f"-------------------------------------\nTask Categories\n-------------------------------------"
+                self.eval_records.append(info)
+                print(info)
+                for q_type in q_type_dict[video_type]:
+                    info = f"{q_type}: {100 * q_type_dict[video_type][q_type]['correct'] / q_type_dict[video_type][q_type]['answered'] if q_type_dict[video_type][q_type]['answered'] > 0 else 0 : .1f}%"
+                    self.eval_records.append(info)
+                    print(info)
+            info = f"-------------------------------------\nOverall Performance\n-------------------------------------"
+            print(info)
+            total_correct = sum([q_type_dict[video_type][q_type]["correct"] for q_type in TASK_CATEGORIES])
+            total_answered = sum([q_type_dict[video_type][q_type]["answered"] for q_type in TASK_CATEGORIES])
+            info = f"Overall: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+            self.eval_records.append(info)
+            print(info+'\n')
+        # Print the results for the entire dataset
+        info = f"=====================================\nEvaluation on the entire dataset\n====================================="
+        self.eval_records.append(info)
+        print(info)
+        if return_categories_accuracy:
+            info = f"-------------------------------------\nVideo Categories\n-------------------------------------"
+            self.eval_records.append(info)
+            print(info)
+            for v_type in CATEGORIES:
+                total_correct = sum([v_type_dict[video_type][v_type]["correct"] for video_type in video_types])
+                total_answered = sum([v_type_dict[video_type][v_type]["answered"] for video_type in video_types])
+                info = f"{v_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+                self.eval_records.append(info)
+                print(info)
+        if return_sub_categories_accuracy:
+            info = f"-------------------------------------\nVideo Sub Categories\n-------------------------------------"
+            self.eval_records.append(info)
+            print(info)
+            for v_sub_type in SUB_CATEGORIES:
+                total_correct = sum([v_sub_type_dict[video_type][v_sub_type]["correct"] for video_type in video_types])
+                total_answered = sum([v_sub_type_dict[video_type][v_sub_type]["answered"] for video_type in video_types])
+                info = f"{v_sub_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+                self.eval_records.append(info)
+                print(info)
+        if return_task_types_accuracy:
+            info = f"-------------------------------------\nTask Categories\n-------------------------------------"
+            self.eval_records.append(info)
+            print(info)
+            for q_type in TASK_CATEGORIES:
+                total_correct = sum([q_type_dict[video_type][q_type]["correct"] for video_type in video_types])
+                total_answered = sum([q_type_dict[video_type][q_type]["answered"] for video_type in video_types])
+                info = f"{q_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+                self.eval_records.append(info)
+                print(info)
+        info = f"*************************************\nOverall Performance\n*************************************"
+        self.eval_records.append(info)
+        print(info)
+        total_correct = sum([sum([q_type_dict[video_type][q_type]["correct"] for q_type in TASK_CATEGORIES]) for video_type in video_types])
+        total_answered = sum([sum([q_type_dict[video_type][q_type]["answered"] for q_type in TASK_CATEGORIES]) for video_type in video_types])
+        info = f"Overall: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"
+        self.eval_records.append(info)
+        print(info)
+    def save_results(self, pred_path):
+        if os.path.isdir(pred_path):
+            output_dir = os.path.join(pred_path, 'eval_records')
+        else:
+            output_dir = os.path.join(os.path.dirname(pred_path), 'eval_records')
+        os.makedirs(output_dir, exist_ok=True)
+        fout = open(os.path.join(output_dir, f'{self.dataset_name}_eval_result.txt'), 'w')
+        for info in self.eval_records:
+            fout.write(info+'\n')
+        fout.close()

eval_scripts/DREAM-1K/tarsier/models/modeling_qwen2_vl_fast.py ADDED Viewed

	@@ -0,0 +1,1320 @@

+import os
+import math
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn import LayerNorm
+from transformers.modeling_utils import PreTrainedModel
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation, ROPE_INIT_FUNCTIONS
+from transformers.cache_utils import Cache, SlidingWindowCache, StaticCache
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.utils import (
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_flash_attn_2_available,
+    is_flash_attn_greater_or_equal_2_10,
+    logging,
+    replace_return_docstrings,
+)
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+    ModelOutput,
+)
+from transformers.activations import ACT2FN
+from transformers.generation import GenerationMixin
+if is_flash_attn_2_available():
+    from flash_attn import flash_attn_varlen_func
+    from transformers.modeling_flash_attention_utils import _flash_attention_forward
+else:
+    flash_attn_varlen_func = None
+# from apex.normalization.fused_layer_norm import fused_rms_norm_affine
+logger = logging.get_logger(__name__)
+@dataclass
+class Qwen2VLCausalLMOutputWithPast(ModelOutput):
+    """
+    Base class for Qwen2VL causal language model (or autoregressive) outputs.
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Language modeling loss (for next-token prediction).
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
+            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
+            `past_key_values` input) to speed up sequential decoding.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+        rope_deltas (`torch.LongTensor` of shape `(batch_size, )`, *optional*):
+            The rope index difference between sequence length and multimodal rope.
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    past_key_values: Optional[List[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+class Qwen2VLVisionConfig(PretrainedConfig):
+    model_type = "qwen2_vl"
+    def __init__(
+        self,
+        depth=32,
+        embed_dim=1280,
+        hidden_size=3584,
+        hidden_act="quick_gelu",
+        mlp_ratio=4,
+        num_heads=16,
+        in_channels=3,
+        patch_size=14,
+        spatial_merge_size=2,
+        temporal_patch_size=2,
+        attn_implementation='flash_attention_2',
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.depth = depth
+        self.embed_dim = embed_dim
+        self.hidden_size = hidden_size
+        self.hidden_act = hidden_act
+        self.mlp_ratio = mlp_ratio
+        self.num_heads = num_heads
+        self.in_channels = in_channels
+        self.patch_size = patch_size
+        self.spatial_merge_size = spatial_merge_size
+        self.temporal_patch_size = temporal_patch_size
+        self.attn_implementation = attn_implementation
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        if config_dict.get("model_type") == "qwen2_vl":
+            config_dict = config_dict["vision_config"]
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+class Qwen2VLConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Qwen2VLModel`]. It is used to instantiate a
+    Qwen2-VL model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of
+    Qwen2-VL-7B-Instruct [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 152064):
+            Vocabulary size of the Qwen2VL model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`Qwen2VLModel`]
+        hidden_size (`int`, *optional*, defaults to 8192):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 29568):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 80):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 64):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 32768):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 1000000.0):
+            The base period of the RoPE embeddings.
+        use_sliding_window (`bool`, *optional*, defaults to `False`):
+            Whether to use sliding window attention.
+        sliding_window (`int`, *optional*, defaults to 4096):
+            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
+        max_window_layers (`int`, *optional*, defaults to 80):
+            The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        vision_config (`Dict`, *optional*):
+            The config for the visual encoder initialization.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+    ```python
+    >>> from transformers import Qwen2VLForConditionalGeneration, Qwen2VLConfig
+    >>> # Initializing a Qwen2VL style configuration
+    >>> configuration = Qwen2VLConfig()
+    >>> # Initializing a model from the Qwen2-VL-7B style configuration
+    >>> model = Qwen2VLForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "qwen2_vl"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=152064,
+        hidden_size=8192,
+        intermediate_size=29568,
+        num_hidden_layers=80,
+        num_attention_heads=64,
+        num_key_value_heads=8,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-05,
+        use_cache=True,
+        tie_word_embeddings=False,
+        rope_theta=1000000.0,
+        use_sliding_window=False,
+        sliding_window=4096,
+        max_window_layers=80,
+        attention_dropout=0.0,
+        rope_scaling=None,
+        spatial_merge_size=2,
+        attn_implementation='flash_attention_2',
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.use_sliding_window = use_sliding_window
+        self.sliding_window = sliding_window
+        self.max_window_layers = max_window_layers
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_dropout = attention_dropout
+        self.rope_scaling = rope_scaling
+        self.spatial_merge_size = spatial_merge_size
+        self.attn_implementation = attn_implementation
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        # and change type from 'mrope' to 'default' because `mrope` does defeault RoPE calculations
+        # one can set it to "linear"/"dynamic" etc. to have scaled RoPE
+        # TODO: @raushan update config in the hub
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            if self.rope_scaling["type"] == "mrope":
+                self.rope_scaling["type"] = "default"
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self, ignore_keys={"mrope_section"})
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+# Copied from transformers.models.llama.modeling_llama.rotate_half
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_multimodal_rotary_pos_emb(q, k, cos, sin, mrope_section, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding with Multimodal Sections to the query and key tensors (https://qwenlm.github.io/blog/qwen2-vl/).
+    Explanation:
+        Multimodal 3D rotary position embedding is an extension to 1D rotary position embedding. The input embedding
+        sequence contains vision (images / videos) embedding and text embedding or just contains text embedding. For
+        vision embedding part, we apply rotary position embedding on temporal, height and width dimension seperately.
+        Here we split the channel dimension to 3 chunks for the temporal, height and width rotary position embedding.
+        For text embedding part, we just apply 1D rotary position embedding. The three rotary position index (temporal,
+        height and width) of text embedding is always the same, so the text embedding rotary position embedding has no
+        difference with modern LLMs.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`):
+            The position indices of the tokens corresponding to the query and key tensors. For example, this can be
+            used to pass offsetted position ids when working with a KV-cache.
+        mrope_section(`List(int)`):
+            Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    mrope_section = mrope_section * 2
+    cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(
+        unsqueeze_dim
+    )
+    sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(
+        unsqueeze_dim
+    )
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def apply_rotary_pos_emb_vision(tensor: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:
+    orig_dtype = tensor.dtype
+    tensor = tensor.float()
+    cos = freqs.cos()
+    sin = freqs.sin()
+    cos = cos.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
+    sin = sin.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
+    output = (tensor * cos) + (rotate_half(tensor) * sin)
+    output = output.to(orig_dtype)
+    return output
+class VisionRotaryEmbedding(nn.Module):
+    def __init__(self, dim: int, theta: float = 10000.0) -> None:
+        super().__init__()
+        inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+    def forward(self, seqlen: int) -> torch.Tensor:
+        seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+        freqs = torch.outer(seq, self.inv_freq)
+        return freqs
+class PatchEmbed(nn.Module):
+    def __init__(
+        self,
+        patch_size: int = 14,
+        temporal_patch_size: int = 2,
+        in_channels: int = 3,
+        embed_dim: int = 1152,
+    ) -> None:
+        super().__init__()
+        self.patch_size = patch_size
+        self.temporal_patch_size = temporal_patch_size
+        self.in_channels = in_channels
+        self.embed_dim = embed_dim
+        kernel_size = [temporal_patch_size, patch_size, patch_size]
+        self.proj = nn.Conv3d(in_channels, embed_dim, kernel_size=kernel_size, stride=kernel_size, bias=False)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        target_dtype = self.proj.weight.dtype
+        hidden_states = hidden_states.view(
+            -1, self.in_channels, self.temporal_patch_size, self.patch_size, self.patch_size
+        )
+        hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
+        return hidden_states
+class PatchMerger(nn.Module):
+    def __init__(self, dim: int, context_dim: int, spatial_merge_size: int = 2) -> None:
+        super().__init__()
+        self.hidden_size = context_dim * (spatial_merge_size**2)
+        self.ln_q = LayerNorm(context_dim, eps=1e-6)
+        self.mlp = nn.Sequential(
+            nn.Linear(self.hidden_size, self.hidden_size),
+            nn.GELU(),
+            nn.Linear(self.hidden_size, dim),
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
+        return x
+class VisionMlp(nn.Module):
+    def __init__(self, dim: int, hidden_dim: int, hidden_act: str) -> None:
+        super().__init__()
+        self.fc1 = nn.Linear(dim, hidden_dim)
+        self.act = ACT2FN[hidden_act]
+        self.fc2 = nn.Linear(hidden_dim, dim)
+    def forward(self, x) -> torch.Tensor:
+        return self.fc2(self.act(self.fc1(x)))
+class VisionAttention(nn.Module):
+    def __init__(self, dim: int, num_heads: int = 16) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=True)
+        self.proj = nn.Linear(dim, dim)
+    def forward(
+        self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor, rotary_pos_emb: torch.Tensor = None
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        attention_mask = torch.full(
+            [1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype
+        )
+        for i in range(1, len(cu_seqlens)):
+            attention_mask[..., cu_seqlens[i - 1] : cu_seqlens[i], cu_seqlens[i - 1] : cu_seqlens[i]] = 0
+        q = q.transpose(0, 1)
+        k = k.transpose(0, 1)
+        v = v.transpose(0, 1)
+        attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
+        attn_weights = attn_weights + attention_mask
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
+        attn_output = torch.matmul(attn_weights, v)
+        attn_output = attn_output.transpose(0, 1)
+        attn_output = attn_output.reshape(seq_length, -1)
+        attn_output = self.proj(attn_output)
+        return attn_output
+class VisionFlashAttention2(nn.Module):
+    def __init__(self, dim: int, num_heads: int = 16) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=True)
+        self.proj = nn.Linear(dim, dim)
+    def forward(
+        self, hidden_states: torch.Tensor, cu_seqlens: torch.Tensor, rotary_pos_emb: torch.Tensor = None
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
+        attn_output = flash_attn_varlen_func(q, k, v, cu_seqlens, cu_seqlens, max_seqlen, max_seqlen).reshape(
+            seq_length, -1
+        )
+        attn_output = self.proj(attn_output)
+        return attn_output
+QWEN2_VL_VISION_ATTENTION_CLASSES = {
+    "eager": VisionAttention,
+    "flash_attention_2": VisionFlashAttention2,
+}
+class Qwen2VLVisionBlock(nn.Module):
+    def __init__(self, config, attn_implementation: str = "sdpa") -> None:
+        super().__init__()
+        self.norm1 = LayerNorm(config.embed_dim, eps=1e-6)
+        self.norm2 = LayerNorm(config.embed_dim, eps=1e-6)
+        mlp_hidden_dim = int(config.embed_dim * config.mlp_ratio)
+        self.attn = QWEN2_VL_VISION_ATTENTION_CLASSES[attn_implementation](
+            config.embed_dim, num_heads=config.num_heads
+        )
+        self.mlp = VisionMlp(dim=config.embed_dim, hidden_dim=mlp_hidden_dim, hidden_act=config.hidden_act)
+    def forward(self, hidden_states, cu_seqlens, rotary_pos_emb) -> torch.Tensor:
+        hidden_states = hidden_states + self.attn(
+            self.norm1(hidden_states), cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb
+        )
+        hidden_states = hidden_states + self.mlp(self.norm2(hidden_states))
+        return hidden_states
+class Qwen2VLPreTrainedModel(PreTrainedModel):
+    config_class = Qwen2VLConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Qwen2VLDecoderLayer", "Qwen2VLVisionBlock"]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn_2 = True
+    _supports_sdpa = False
+    _supports_cache_class = True
+    _supports_static_cache = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, (nn.Linear, nn.Conv3d)):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+class Qwen2VisionTransformerPretrainedModel(Qwen2VLPreTrainedModel):
+    config_class = Qwen2VLVisionConfig
+    _no_split_modules = ["Qwen2VLVisionBlock"]
+    def __init__(self, config) -> None:
+        super().__init__(config)
+        self.spatial_merge_size = config.spatial_merge_size
+        self.patch_embed = PatchEmbed(
+            patch_size=config.patch_size,
+            temporal_patch_size=config.temporal_patch_size,
+            in_channels=config.in_channels,
+            embed_dim=config.embed_dim,
+        )
+        head_dim = config.embed_dim // config.num_heads
+        self.rotary_pos_emb = VisionRotaryEmbedding(head_dim // 2)
+        self.blocks = nn.ModuleList(
+            [Qwen2VLVisionBlock(config, config.attn_implementation) for _ in range(config.depth)]
+        )
+        self.merger = PatchMerger(
+            dim=config.hidden_size, context_dim=config.embed_dim, spatial_merge_size=config.spatial_merge_size
+        )
+        # Initialize weights and apply final processing
+        self.gradient_checkpointing = False
+        self.post_init()
+    def get_dtype(self) -> torch.dtype:
+        return self.blocks[0].mlp.fc2.weight.dtype
+    def get_device(self) -> torch.device:
+        return self.blocks[0].mlp.fc2.weight.device
+    def rot_pos_emb(self, grid_thw):
+        pos_ids = []
+        for t, h, w in grid_thw:
+            hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
+            hpos_ids = hpos_ids.reshape(
+                h // self.spatial_merge_size,
+                self.spatial_merge_size,
+                w // self.spatial_merge_size,
+                self.spatial_merge_size,
+            )
+            hpos_ids = hpos_ids.permute(0, 2, 1, 3)
+            hpos_ids = hpos_ids.flatten()
+            wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
+            wpos_ids = wpos_ids.reshape(
+                h // self.spatial_merge_size,
+                self.spatial_merge_size,
+                w // self.spatial_merge_size,
+                self.spatial_merge_size,
+            )
+            wpos_ids = wpos_ids.permute(0, 2, 1, 3)
+            wpos_ids = wpos_ids.flatten()
+            pos_ids.append(torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1))
+        pos_ids = torch.cat(pos_ids, dim=0)
+        max_grid_size = grid_thw[:, 1:].max()
+        rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
+        rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
+        return rotary_pos_emb
+    def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.patch_embed(hidden_states)
+        rotary_pos_emb = self.rot_pos_emb(grid_thw)
+        cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
+            dim=0, dtype=torch.int32
+        )
+        cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
+        for blk in self.blocks:
+            if self.gradient_checkpointing and self.training:
+                hidden_states = self._gradient_checkpointing_func(
+                    blk.__call__,
+                    hidden_states,
+                    cu_seqlens,
+                    rotary_pos_emb,
+                )
+            else:
+                hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
+        return self.merger(hidden_states)
+# class Qwen2RMSNorm(nn.Module):
+#     def __init__(self, hidden_size, eps=1e-6):
+#         """
+#         Qwen2RMSNorm is equivalent to T5LayerNorm
+#         """
+#         super().__init__()
+#         self.weight = nn.Parameter(torch.ones(hidden_size))
+#         self.variance_epsilon = eps
+#         self.normalized_shape = torch.Size((hidden_size, ))
+#     def forward(self, hidden_states):
+#         return fused_rms_norm_affine(input=hidden_states,
+#                                      weight=self.weight,
+#                                      normalized_shape=self.normalized_shape,
+#                                      eps=self.variance_epsilon,
+#                                      memory_efficient=True)
+class Qwen2RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Qwen2RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
+class Qwen2VLRotaryEmbedding(nn.Module):
+    def __init__(
+        self,
+        dim=None,
+        max_position_embeddings=2048,
+        base=10000,
+        device=None,
+        scaling_factor=1.0,
+        rope_type="default",
+        config: Optional[Qwen2VLConfig] = None,
+    ):
+        super().__init__()
+        # TODO (joao): remove the `if` below, only used for BC
+        self.rope_kwargs = {}
+        if config is None:
+            logger.warning_once(
+                "`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the "
+                "`config` argument. All other arguments will be removed in v4.46"
+            )
+            self.rope_kwargs = {
+                "rope_type": rope_type,
+                "factor": scaling_factor,
+                "dim": dim,
+                "base": base,
+                "max_position_embeddings": max_position_embeddings,
+            }
+            self.rope_type = rope_type
+            self.max_seq_len_cached = max_position_embeddings
+            self.original_max_seq_len = max_position_embeddings
+        else:
+            # BC: "rope_type" was originally "type"
+            if config.rope_scaling is not None:
+                self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+            else:
+                self.rope_type = "default"
+            self.max_seq_len_cached = config.max_position_embeddings
+            self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, **self.rope_kwargs)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    def _dynamic_frequency_update(self, position_ids, device):
+        """
+        dynamic RoPE layers should recompute `inv_freq` in the following situations:
+        1 - growing beyond the cached sequence length (allow scaling)
+        2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
+        """
+        seq_len = torch.max(position_ids) + 1
+        if seq_len > self.max_seq_len_cached:  # growth
+            inv_freq, self.attention_scaling = self.rope_init_fn(
+                self.config, device, seq_len=seq_len, **self.rope_kwargs
+            )
+            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: may break with compilation
+            self.max_seq_len_cached = seq_len
+        if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len:  # reset
+            self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
+            self.max_seq_len_cached = self.original_max_seq_len
+    @torch.no_grad()
+    def forward(self, x, position_ids):
+        position_ids = position_ids.permute(2, 0, 1)
+        if "dynamic" in self.rope_type:
+            self._dynamic_frequency_update(position_ids, device=x.device)
+        # Core RoPE block. In contrast to other models, Qwen2_VL has different position ids for thw grids
+        # So we expand the inv_freq to shape (3, ...)
+        inv_freq_expanded = self.inv_freq[None, None, :, None].float().expand(3, position_ids.shape[1], -1, 1)
+        position_ids_expanded = position_ids[:, :, None, :].float()  # shape (3, bs, 1, positions)
+        # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
+        cos = cos * self.attention_scaling
+        sin = sin * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+# Copied from transformers.models.qwen2.modeling_qwen2.Qwen2MLP
+class Qwen2MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, hidden_state):
+        return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
+# Copied from transformers.models.llama.modeling_llama.repeat_kv
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+class Qwen2VLAttention(nn.Module):
+    """
+    Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
+    and "Generating Long Sequences with Sparse Transformers".
+    """
+    def __init__(self, config: Qwen2VLConfig, layer_idx: Optional[int] = None):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        if layer_idx is None:
+            logger.warning_once(
+                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
+                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
+                "when creating this class."
+            )
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.is_causal = True
+        self.attention_dropout = config.attention_dropout
+        self.rope_scaling = config.rope_scaling
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+class Qwen2VLFlashAttention2(Qwen2VLAttention):
+    """
+    Qwen2VL flash attention module, following Qwen2VL attention module. This module inherits from `Qwen2VLAttention`
+    as the weights of the module stays untouched. The only required change would be on the forward pass
+    where it needs to correctly call the public API of flash attention and deal with padding tokens
+    in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom
+    config.max_window_layers layers.
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+        use_rmpad: Optional[bool] = False,
+        cu_seqlens: Optional[torch.Tensor] = False,
+    ):
+        """
+        Train:
+          unpad: (bsz, q_len) = (1, acc_seqlen)
+          pad: (bsz, q_len) = (bsz, q_len)
+        Test:
+          first_iter: (bsz, q_len) = (bsz, q_len)
+          other: (bsz, q_len) = (bsz, 1)
+        """
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_multimodal_rotary_pos_emb(
+            query_states, key_states, cos, sin, self.rope_scaling["mrope_section"]
+        )
+        if past_key_value is not None:
+            cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        # repeat k/v heads if n_kv_heads < n_heads
+        key_states = repeat_kv(key_states, self.num_key_value_groups)
+        value_states = repeat_kv(value_states, self.num_key_value_groups)
+        dropout_rate = 0.0 if not self.training else self.attention_dropout
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in float16 just to be sure everything works as expected.
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+        # Reashape to the expected shape for Flash Attention
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        if use_rmpad:
+            max_seqlen = torch.max(cu_seqlens[1:] - cu_seqlens[:-1]).item()
+            attn_output = flash_attn_varlen_func(
+                query_states.squeeze(0), key_states.squeeze(0), value_states.squeeze(0),
+                cu_seqlens_q=cu_seqlens, cu_seqlens_k=cu_seqlens,
+                max_seqlen_q=max_seqlen, max_seqlen_k=max_seqlen,
+                dropout_p=dropout_rate,
+                causal=self.is_causal, window_size=(-1, -1),
+            )
+        else:
+            attn_output = _flash_attention_forward(
+                query_states, key_states, value_states,
+                attention_mask,
+                q_len,
+                dropout=dropout_rate,
+                sliding_window=None,
+                is_causal=self.is_causal,
+                use_top_left_mask=self._flash_attn_uses_top_left_mask,
+            )
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
+        attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_key_value
+QWEN2_VL_ATTENTION_CLASSES = {
+    "flash_attention_2": Qwen2VLFlashAttention2,
+}
+class Qwen2VLDecoderLayer(nn.Module):
+    def __init__(self, config: Qwen2VLConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        if config.attn_implementation != "flash_attention_2":
+            logger.error(
+                f"只支持 flash_attention_2！config.attn_implementation={config.attn_implementation}"
+            )
+        self.self_attn = QWEN2_VL_ATTENTION_CLASSES[config.attn_implementation](config, layer_idx)
+        self.mlp = Qwen2MLP(config)
+        self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
+        use_rmpad: Optional[bool] = False,
+        cu_seqlens: Optional[torch.Tensor] = False,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, sequence_length)` where padding elements are indicated by 0.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+            cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
+                Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
+                with `head_dim` being the embedding dimension of each attention head.
+            kwargs (`dict`, *optional*):
+                Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
+                into the model
+        """
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            position_embeddings=position_embeddings,
+            use_rmpad=use_rmpad,
+            cu_seqlens=cu_seqlens,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        if use_cache:
+            outputs += (present_key_value,)
+        return outputs
+class Qwen2VLModel(Qwen2VLPreTrainedModel):
+    def __init__(self, config: Qwen2VLConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList([Qwen2VLDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
+        self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Qwen2VLRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        use_rmpad: Optional[bool] = False,
+        cu_seqlens: Optional[torch.Tensor] = False,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+        hidden_states = inputs_embeds
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = None
+        for decoder_layer in self.layers:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    past_key_values,
+                    output_attentions,
+                    use_cache,
+                    position_embeddings,
+                    use_rmpad,
+                    cu_seqlens,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_values,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                    position_embeddings=position_embeddings,
+                    use_rmpad=use_rmpad,
+                    cu_seqlens=cu_seqlens,
+                )
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache = layer_outputs[2 if output_attentions else 1]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+class Qwen2VLForCausalLM(Qwen2VLPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = Qwen2VLModel(config)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.padding_side = "left"  # set it to left by default, user can use setter to change padding_sides
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    def get_rope_index(
+        self,
+        input_ids: torch.LongTensor,
+        image_grid_thw: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Calculate the 3D rope index based on image and video's temporal, height and width in LLM.
+        Explanation:
+            Each embedding sequence contains vision embedding and text embedding or just contains text embedding.
+            For pure text embedding sequence, the rotary position embedding has no difference with mordern LLMs.
+            Examples:
+                input_ids: [T T T T T], here T is for text.
+                temporal position_ids: [0, 1, 2, 3, 4]
+                height position_ids: [0, 1, 2, 3, 4]
+                width position_ids: [0, 1, 2, 3, 4]
+            For vision and text embedding sequence, we calculate 3D rotary position embedding for vision part
+            and 1D rotary position embeddin for text part.
+            Examples:
+                Assume we have a video input with 3 temporal patches, 2 height patches and 2 width patches.
+                input_ids: [V V V V V V V V V V V V T T T T T], here V is for vision.
+                vision temporal position_ids: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
+                vision height position_ids: [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]
+                vision width position_ids: [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
+                text temporal position_ids: [3, 4, 5, 6, 7]
+                text height position_ids: [3, 4, 5, 6, 7]
+                text width position_ids: [3, 4, 5, 6, 7]
+                Here we calculate the text start position_ids as the max vision position_ids plus 1.
+        Args:
+            input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+                it.
+            image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
+                The temporal, height and width of feature shape of each image in LLM.
+            video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
+                The temporal, height and width of feature shape of each video in LLM.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+        Returns:
+            position_ids (`torch.LongTensor` of shape `(3, batch_size, sequence_length)`)
+            mrope_position_deltas (`torch.Tensor` of shape `(batch_size)`)
+        """
+        spatial_merge_size = self.config.spatial_merge_size
+        vision_token_id = self.config.image_token_id
+        vision_start_token_id = self.config.vision_start_token_id
+        assert image_grid_thw is not None # TODO：测试纯文本会不会卡住
+        total_input_ids = input_ids
+        position_ids = torch.ones(
+            3, input_ids.shape[0], input_ids.shape[1], dtype=input_ids.dtype, device=input_ids.device
+        )
+        vision_index = 0
+        for i, input_ids in enumerate(total_input_ids):
+            if attention_mask is not None:
+                input_ids = input_ids[attention_mask[i] == 1]
+            vision_start_indices = torch.argwhere(input_ids == vision_start_token_id).squeeze(1)
+            vision_num = (input_ids[vision_start_indices + 1] == vision_token_id).sum()
+            input_tokens = input_ids.tolist()
+            llm_pos_ids_list: list = []
+            st = 0
+            remain_vision_num = vision_num
+            for _ in range(vision_num):
+                if vision_token_id in input_tokens and remain_vision_num > 0:
+                    ed_vision = input_tokens.index(vision_token_id, st)
+                else:
+                    ed_vision = len(input_tokens) + 1
+                t, h, w = (
+                    image_grid_thw[vision_index][0],
+                    image_grid_thw[vision_index][1],
+                    image_grid_thw[vision_index][2],
+                )
+                vision_index += 1
+                remain_vision_num -= 1
+                ed = ed_vision
+                llm_grid_t, llm_grid_h, llm_grid_w = (
+                    t.item(),
+                    h.item() // spatial_merge_size,
+                    w.item() // spatial_merge_size,
+                )
+                text_len = ed - st
+                st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
+                llm_pos_ids_list.append(torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx)
+                t_index = torch.arange(llm_grid_t).view(-1, 1).expand(-1, llm_grid_h * llm_grid_w).flatten()
+                h_index = torch.arange(llm_grid_h).view(1, -1, 1).expand(llm_grid_t, -1, llm_grid_w).flatten()
+                w_index = torch.arange(llm_grid_w).view(1, 1, -1).expand(llm_grid_t, llm_grid_h, -1).flatten()
+                llm_pos_ids_list.append(torch.stack([t_index, h_index, w_index]) + text_len + st_idx)
+                st = ed + llm_grid_t * llm_grid_h * llm_grid_w
+            if st < len(input_tokens):
+                st_idx = llm_pos_ids_list[-1].max() + 1 if len(llm_pos_ids_list) > 0 else 0
+                text_len = len(input_tokens) - st
+                llm_pos_ids_list.append(torch.arange(text_len).view(1, -1).expand(3, -1) + st_idx)
+            llm_positions = torch.cat(llm_pos_ids_list, dim=1).reshape(3, -1)
+            position_ids[..., i, attention_mask[i] == 1] = llm_positions.to(position_ids.device)
+        position_ids = position_ids.permute(1, 2, 0)
+        return position_ids
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        use_rmpad: Optional[bool] = False,
+        cu_seqlens: Optional[torch.Tensor] = False,
+    ) -> Union[Tuple, Qwen2VLCausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            use_rmpad=use_rmpad,
+            cu_seqlens=cu_seqlens,
+        )
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return output
+        return Qwen2VLCausalLMOutputWithPast(
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

eval_scripts/DREAM-1K/tarsier/models/modeling_tarsier.py ADDED Viewed

	@@ -0,0 +1,502 @@

+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union, Dict, Any
+import math
+import torch.utils.checkpoint
+from torch import nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel, AutoConfig, AutoModel
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache
+from transformers.modeling_outputs import ModelOutput
+from transformers.utils import logging
+from transformers.configuration_utils import PretrainedConfig
+from transformers.dynamic_module_utils import get_class_from_dynamic_module
+from transformers.models.auto import AutoModel, AutoModelForCausalLM, CONFIG_MAPPING
+from transformers.generation import GenerationMixin
+from transformers import LlamaForCausalLM, Qwen2ForCausalLM
+# from models.modeling_qwen2 import Qwen2ForCausalLM
+from models.modeling_qwen2_vl_fast import Qwen2VLForCausalLM
+from models.utils import _pad_input, _unpad_input
+logger = logging.get_logger(__name__)
+class LlavaConfig(PretrainedConfig):
+    model_type = "llava"
+    is_composition = False
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        ignore_index=-100,
+        image_token_index=32000,
+        projector_hidden_act="gelu",
+        vision_feature_select_strategy="default",
+        vision_feature_layer=-2,
+        image_newline_idx=32002,
+        image_new_idx=32003,
+        projection_head="MLP",
+        **kwargs,
+    ):
+        self.ignore_index = ignore_index
+        self.image_token_index = image_token_index
+        self.projector_hidden_act = projector_hidden_act
+        self.vision_feature_select_strategy = vision_feature_select_strategy
+        self.vision_feature_layer = vision_feature_layer
+        self.image_newline_idx = image_newline_idx
+        self.image_new_idx = image_new_idx
+        self.projection_head = projection_head
+        self.vision_config = vision_config
+        if isinstance(self.vision_config, dict):
+            vision_config["model_type"] = (
+                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
+            )
+            if 'auto_map' in vision_config:
+                repo_id, class_ref = vision_config['auto_map']['AutoConfig'].split("--")
+                config_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
+                self.vision_config = config_class(**vision_config)
+            elif vision_config["model_type"] in CONFIG_MAPPING:
+                self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
+            else:
+                raise ValueError(f'vision_config["model_type"] = {vision_config["model_type"]} not supported!')
+        self.text_config = text_config
+        if isinstance(self.text_config, dict):
+            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
+            if 'auto_map' in text_config:
+                repo_id, class_ref = text_config['auto_map']['AutoConfig'].split("--")
+                config_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs)
+                self.text_config = config_class(**text_config)
+            elif text_config["model_type"] in CONFIG_MAPPING:
+                self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
+            else:
+                raise ValueError(f'text_config["model_type"] = {text_config["model_type"]} not supported!')
+        super().__init__(**kwargs)
+@dataclass
+# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Llava
+class LlavaCausalLMOutputWithPast(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    past_key_values: Optional[List[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    position_ids: Optional[torch.LongTensor] = None
+def add_split_tokens(image_features, image_newline_embed, image_new_embed):
+    num_images, num_image_patches, embed_dim = image_features.shape
+    num_height_patches, num_width_patches = int(math.sqrt(num_image_patches)), int(math.sqrt(num_image_patches))
+    # add image_newline
+    image_features = image_features.view(num_images, num_height_patches, num_width_patches, embed_dim)
+    image_features = torch.cat([
+        image_features,
+        image_newline_embed.expand((num_images, num_height_patches, 1, embed_dim))
+    ], dim=2)
+    num_image_patches += num_height_patches
+    image_features = image_features.view(num_images, num_image_patches, embed_dim)
+    # add image_new
+    image_features = torch.cat([
+        image_features,
+        image_new_embed.expand((num_images, 1, embed_dim))
+    ], dim = 1)
+    return image_features
+class LlavaMultiModalProjector(nn.Module):
+    def __init__(self, config: LlavaConfig):
+        super().__init__()
+        self.config = config
+        self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
+        self.act = ACT2FN[config.projector_hidden_act]
+        self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)
+        image_newline_idx = torch.tensor([config.image_newline_idx], dtype=torch.long)
+        image_new_idx = torch.tensor([config.image_new_idx], dtype=torch.long)
+        self.register_buffer('image_newline_idx', image_newline_idx, persistent=False)
+        self.register_buffer('image_new_idx', image_new_idx, persistent=False)
+    def forward(self, image_features, input_embeddings):
+        selected_image_feature = image_features[self.config.vision_feature_layer]
+        if self.config.vision_feature_select_strategy == "default":
+            selected_image_feature = selected_image_feature[:, 1:]
+        elif self.config.vision_feature_select_strategy == "full":
+            selected_image_feature = selected_image_feature
+        else:
+            raise ValueError(
+                f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
+            )
+        hidden_states = self.linear_1(selected_image_feature)
+        hidden_states = self.act(hidden_states)
+        hidden_states = self.linear_2(hidden_states)
+        image_newline_embed = input_embeddings(self.image_newline_idx).squeeze()
+        image_new_embed = input_embeddings(self.image_new_idx).squeeze()
+        hidden_states = add_split_tokens(hidden_states, image_newline_embed, image_new_embed)
+        return hidden_states
+class PixelShuffleMultiModalProjector(nn.Module):
+    def __init__(self, config: LlavaConfig):
+        super().__init__()
+        self.config = config
+        self.downsample_ratio = 0.5
+        vit_hidden_size = config.vision_config.hidden_size
+        llm_hidden_size = config.text_config.hidden_size
+        self.mlp = nn.Sequential(
+            nn.LayerNorm(vit_hidden_size * int(1 / self.downsample_ratio) ** 2),
+            nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio) ** 2, llm_hidden_size),
+            nn.GELU(),
+            nn.Linear(llm_hidden_size, llm_hidden_size)
+        )
+        image_newline_idx = torch.tensor([config.image_newline_idx], dtype=torch.long)
+        image_new_idx = torch.tensor([config.image_new_idx], dtype=torch.long)
+        self.register_buffer('image_newline_idx', image_newline_idx, persistent=False)
+        self.register_buffer('image_new_idx', image_new_idx, persistent=False)
+    def forward(self, image_features, input_embeddings):
+        selected_image_feature = image_features[self.config.vision_feature_layer]
+        if self.config.vision_feature_select_strategy == "default":
+            selected_image_feature = selected_image_feature[:, 1:]
+        elif self.config.vision_feature_select_strategy == "full":
+            selected_image_feature = selected_image_feature
+        else:
+            raise ValueError(
+                f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
+            )
+        image_features = self.pixel_shuffle(selected_image_feature)
+        hidden_states = self.mlp(image_features)
+        image_newline_embed = input_embeddings(self.image_newline_idx).squeeze()
+        image_new_embed = input_embeddings(self.image_new_idx).squeeze()
+        hidden_states = add_split_tokens(hidden_states, image_newline_embed, image_new_embed)
+        return hidden_states
+    def pixel_shuffle(self, x, scale_factor=0.5):
+        if scale_factor == 1:
+            return x
+        n, wh, c = x.shape
+        h, w = int(math.sqrt(wh)), int(math.sqrt(wh))
+        x = x.view(n, h, w, c)
+        n, w, h, c = x.size()
+        # N, W, H, C --> N, W, H * scale, C // scale
+        x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
+        # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
+        x = x.permute(0, 2, 1, 3).contiguous()
+        # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
+        x = x.view(n, int(h * scale_factor), int(w * scale_factor),
+                   int(c / (scale_factor * scale_factor)))
+        x = x.permute(0, 2, 1, 3).contiguous()
+        x = x.view(x.shape[0], -1, x.shape[-1])
+        return x
+LLAVA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`LlavaConfig`] or [`LlavaVisionConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+class TarsierPreTrainedModel(PreTrainedModel):
+    config_class = LlavaConfig
+    base_model_prefix = "llm"
+    supports_gradient_checkpointing = True # TODO: support latest gc
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn_2 = True
+    _supports_sdpa = False
+    _supports_cache_class = True # TODO: support different cache
+    _supports_static_cache = True
+    def _init_weights(self, module):
+        std = (
+            self.config.initializer_range
+            if hasattr(self.config, "initializer_range")
+            else self.config.text_config.initializer_range
+        )
+        if hasattr(module, "class_embedding"):
+            module.class_embedding.data.normal_(mean=0.0, std=std)
+        if isinstance(module, (nn.Linear, nn.Conv2d, nn.Conv3d)):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.weight.data.fill_(1.0)
+            if module.bias is not None:
+                module.bias.data.zero_()
+    @property
+    def _no_split_modules(self):
+        return self.language_model._no_split_modules + self.vision_tower._no_split_modules
+class TarsierForConditionalGeneration(TarsierPreTrainedModel, GenerationMixin):
+    def __init__(self, config: LlavaConfig):
+        super().__init__(config)
+        self.vision_tower = AutoModel.from_config(config.vision_config, trust_remote_code=True)
+        if config.text_config.model_type == 'qwen2':
+            self.language_model = Qwen2ForCausalLM(config.text_config)
+        elif config.text_config.model_type == 'qwen2_vl':
+            self.language_model = Qwen2VLForCausalLM(config.text_config)
+        elif config.text_config.model_type == 'llama':
+            self.language_model = LlamaForCausalLM(config.text_config)
+        else:
+            raise ValueError(f'{config.text_config.model_type} not supported!')
+        if config.projection_head == 'Pixel_Shuffle':
+            self.multi_modal_projector = PixelShuffleMultiModalProjector(config)
+        elif config.projection_head == 'MLP':
+            self.multi_modal_projector = LlavaMultiModalProjector(config)
+        elif config.projection_head == 'auto_map':
+            repo_id, class_ref = config.auto_map['ProjectionLayer'].split("--")
+            model_class = get_class_from_dynamic_module(class_ref, repo_id)
+            self.multi_modal_projector = model_class(config)
+        elif config.projection_head is None:
+            self.multi_modal_projector = lambda x, *args, **kwargs: x
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self):
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+    def set_decoder(self, decoder):
+        self.language_model.set_decoder(decoder)
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+    def tie_weights(self):
+        return self.language_model.tie_weights()
+    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
+        model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
+        # update vocab size
+        self.config.text_config.vocab_size = model_embeds.num_embeddings
+        return model_embeds
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        pixel_values: torch.FloatTensor = None,
+        image_grid_thw: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        num_images: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        use_rmpad: Optional[bool] = False,
+        **kwargs,
+    ) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is None:
+            raise ValueError("You must specify input_ids")
+        bsz, max_seq_len = input_ids.shape[0], input_ids.shape[1]
+        if max_seq_len > 1:
+            special_image_mask = input_ids == self.config.image_token_index
+            print(f'[{input_ids.device}] num_images: {num_images.tolist()} num_image_tokens: {special_image_mask.sum(-1).tolist()}', flush=True)
+        if position_ids is None:
+            if 'Qwen2VLForCausalLM' in self.language_model.__class__.__name__:
+                position_ids = self.language_model.get_rope_index(input_ids, image_grid_thw, attention_mask) # [bsz, seqlen, 3]
+            else:
+                position_ids = attention_mask.long().cumsum(-1) - 1 #  # [bsz, seqlen]
+                position_ids.masked_fill_(attention_mask == 0, 1)
+        if use_rmpad:
+            input_ids, input_ids_indices, cu_seqlens, _ = _unpad_input(input_ids, attention_mask) # [bsz, seqlen] -> [1, seqlen]
+            position_ids, _, _, _ = _unpad_input(position_ids, attention_mask)
+            input_ids, position_ids = input_ids.unsqueeze(0), position_ids.unsqueeze(0)
+        else:
+            input_ids_indices, cu_seqlens = None, None
+        inputs_embeds = self.get_input_embeddings()(input_ids) # [1, seqlen, dim]
+        image_features = None
+        if pixel_values is not None: # training / first step in generation
+            if 'Qwen2VLForCausalLM' in self.language_model.__class__.__name__:
+                pixel_values = pixel_values.type(self.vision_tower.get_dtype())
+                image_features = self.vision_tower(pixel_values, image_grid_thw)
+            else:
+                image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+                image_features = self.multi_modal_projector(
+                    image_outputs.hidden_states,
+                    self.get_input_embeddings(),
+                )
+            special_image_mask = (input_ids == self.config.image_token_index).to(inputs_embeds.device)
+            if special_image_mask.sum() > 0:
+                image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+                inputs_embeds = inputs_embeds.masked_scatter(
+                    special_image_mask.unsqueeze(-1).expand_as(inputs_embeds),
+                    image_features
+                )
+            else:
+                inputs_embeds = image_features.sum(dim=(0,1)) * 0. + inputs_embeds
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            use_rmpad=use_rmpad,
+            cu_seqlens=cu_seqlens,
+        )
+        logits = outputs[0]
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            if use_rmpad:
+                labels = labels.view(-1)[input_ids_indices.long()]
+                shift_labels = torch.cat((labels[1:], labels.new_ones((1))*-100))
+                shift_labels.requires_grad = False
+                lbl_seq_lens = (cu_seqlens[1:]-1).long()
+                shift_labels[lbl_seq_lens] = -100
+                loss = loss_fct(logits.squeeze(0), shift_labels)
+            else:
+                # Shift so that tokens < n predict n
+                shift_logits = logits[..., :-1, :].contiguous()
+                shift_labels = labels[..., 1:].contiguous()
+                # Flatten the tokens
+                shift_logits = shift_logits.view(-1, self.config.text_config.vocab_size)
+                shift_labels = shift_labels.view(-1)
+                # Enable model parallelism
+                shift_labels = shift_labels.to(shift_logits.device)
+                loss = loss_fct(shift_logits, shift_labels)
+        elif use_rmpad: # 训练的时候，就不 unpad logits 了，节省显存。
+            logits = _pad_input(logits.squeeze(0), input_ids_indices, bsz, max_seq_len)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return LlavaCausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            position_ids=position_ids,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        attention_mask=None,
+        position_ids=None,
+        past_key_values=None,
+        cache_position=None,
+        use_cache=True,
+        pixel_values=None,
+        image_grid_thw=None,
+        **kwargs,
+    ):
+        if past_key_values is not None:
+            past_length = past_key_values.get_seq_length()
+            input_ids = input_ids[:, past_length:]
+        model_inputs = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": past_key_values,
+            "use_cache": use_cache,
+        }
+        if kwargs.get('num_images') is not None:
+            model_inputs['num_images'] = kwargs['num_images']
+        if cache_position[0] == 0:
+            # If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
+            # Otherwise we need pixel values to be passed to model
+            model_inputs["pixel_values"] = pixel_values
+            model_inputs["image_grid_thw"] = image_grid_thw
+        else:
+            model_inputs['position_ids'] = position_ids[:, -1, ...].unsqueeze(1).to(device=input_ids.device) + 1
+        return model_inputs
+    def _update_model_kwargs_for_generation(
+        self,
+        outputs: ModelOutput,
+        model_kwargs: Dict[str, Any],
+        is_encoder_decoder: bool = False,
+        num_new_tokens: int = 1,
+    ) -> Dict[str, Any]:
+        model_kwargs = super()._update_model_kwargs_for_generation(
+            outputs=outputs,
+            model_kwargs=model_kwargs,
+            is_encoder_decoder=is_encoder_decoder,
+            num_new_tokens=num_new_tokens,
+        )
+        if getattr(outputs, "position_ids", None) is not None:
+            model_kwargs["position_ids"] = outputs.position_ids
+        return model_kwargs

eval_scripts/DREAM-1K/tarsier/models/utils.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import torch
+import torch.nn.functional as F
+from einops import rearrange
+def _unpad_input(input_ids, attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
+    input_ids = rearrange(input_ids, 'b s ... -> (b s) ...')[indices]
+    return input_ids, indices, cu_seqlens, max_seqlen_in_batch
+def _pad_input(hidden_states, indices, batch, seqlen):
+    output = torch.zeros(batch * seqlen, *hidden_states.shape[1:], device=hidden_states.device,
+                         dtype=hidden_states.dtype)
+    output[indices] = hidden_states
+    return rearrange(output, '(b s) ... -> b s ...', b=batch)

eval_scripts/DREAM-1K/tarsier/scripts/run_demo_cli.sh ADDED Viewed

	@@ -0,0 +1,15 @@

+#!/bin/bash
+model_path=$1
+n_frames=${2:-8}
+max_new_tokens=${3:-512}
+top_p=${4:-0.8}
+temperature=${5:-0}
+python3 -m tasks.demo_cli \
+    --model_name_or_path $model_path \
+    --config "configs/tarser2_default_config.yaml" \
+    --max_n_frames $n_frames \
+    --max_new_tokens $max_new_tokens \
+    --top_p $top_p \
+    --temperature $temperature

eval_scripts/DREAM-1K/tarsier/scripts/run_demo_gradio.sh ADDED Viewed

	@@ -0,0 +1,9 @@

+#!/bin/bash
+model_path=$1
+max_n_frames=${2:-8}
+export MODEL_PATH=$model_path
+export MAX_N_FRAMES=$max_n_frames
+python3 -m tasks.demo_gradio

eval_scripts/DREAM-1K/tarsier/scripts/run_evaluation_only.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+export AZURE_ENDPOINT=...
+export OPENAI_API_KEY=...
+pred_file=$1
+# benchmarks=${@:2}
+benchmarks=dream
+benchmarks=${benchmarks:-"all"}
+python -m evaluation.evaluate \
+    --pred_file $pred_file \
+    --benchmarks $benchmarks

eval_scripts/DREAM-1K/tarsier/scripts/run_inference_benchmark.sh ADDED Viewed

	@@ -0,0 +1,80 @@

+#!/bin/bash
+# Copy and Modified on: https://github.com/LLaVA-VL/LLaVA-NeXT/blob/video_inference/scripts/video/eval/video_detail_description_eval_shard.sh
+#
+model_name_or_path=$1
+output_dir=$2
+benchmarks=${@:3}
+benchmarks=${benchmarks:-"all"}
+resume=True
+CHUNKS=8
+mkdir $output_dir
+echo "Using $CHUNKS GPUs"
+# Assuming GPULIST is a bash array containing your GPUs
+GPULIST=(0 1 2 3 4 5 6 7)
+# GPULIST=(0 1)
+# Get the number of GPUs
+NUM_GPUS=${#GPULIST[@]}
+# Calculate GPUs per chunk
+GPUS_PER_CHUNK=$((NUM_GPUS / CHUNKS))
+for IDX in $(seq 1 $CHUNKS); do
+    START=$(((IDX-1) * GPUS_PER_CHUNK))
+    LENGTH=$GPUS_PER_CHUNK # Length for slicing, not the end index
+    CHUNK_GPUS=(${GPULIST[@]:$START:$LENGTH})
+    # Convert the chunk GPUs array to a comma-separated string
+    CHUNK_GPUS_STR=$(IFS=,; echo "${CHUNK_GPUS[*]}")
+    ALL_GPUS_FREE=0
+    while [ $ALL_GPUS_FREE -eq 0 ]; do
+        ALL_GPUS_FREE=1  # Assume all GPUs are free initially
+        for GPU_ID in $CHUNK_GPUS; do
+            MEM_USAGE=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $GPU_ID | tr -d '[:space:]')
+            # Assuming a GPU is considered free if its memory usage is less than 100 MiB
+            if [ $MEM_USAGE -ge 100 ]; then
+                ALL_GPUS_FREE=0
+                echo "GPU $GPU_ID is in use. Memory used: ${MEM_USAGE}MiB."
+                break  # Exit the loop early as we found a GPU that is not free
+            fi
+        done
+        if [ $ALL_GPUS_FREE -eq 0 ]; then
+            echo "Not all GPUs in chunk are free. Checking again in 10 seconds..."
+            sleep 10
+        fi
+    done
+    echo "CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR"
+    CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR python3 -m tasks.inference_benchmark \
+        --model_name_or_path $model_name_or_path \
+        --config "configs/tarser2_default_config.yaml" \
+        --max_new_tokens 512 \
+        --top_p 1 \
+        --temperature 0 \
+        --output_dir $output_dir \
+        --output_name predictions \
+        --max_n_samples_per_benchmark -1 \
+        --benchmarks $benchmarks \
+        --resume $resume \
+        --num_chunks $CHUNKS \
+        --chunk_idx $(($IDX - 1)) > $output_dir/run_$IDX.log 2>&1 &
+done
+wait
+python3 -m evaluation.evaluate \
+    --pred_file $output_dir \
+    --benchmarks $benchmarks

eval_scripts/DREAM-1K/tarsier/scripts/run_inference_caption.sh ADDED Viewed

	@@ -0,0 +1,79 @@

+#!/bin/bash
+# Copy and Modified on: https://github.com/LLaVA-VL/LLaVA-NeXT/blob/video_inference/scripts/video/eval/video_detail_description_eval_shard.sh
+#
+model_name_or_path=$1
+input_file=$2
+output_dir=$3
+CHUNKS=1
+resume=True
+mkdir $output_dir
+echo "Using $CHUNKS GPUs"
+# Assuming GPULIST is a bash array containing your GPUs
+# GPULIST=(0 1 2 3 4 5 6 7)
+GPULIST=(0)
+# Get the number of GPUs
+NUM_GPUS=${#GPULIST[@]}
+# Calculate GPUs per chunk
+GPUS_PER_CHUNK=$((NUM_GPUS / CHUNKS))
+for IDX in $(seq 1 $CHUNKS); do
+    START=$(((IDX-1) * GPUS_PER_CHUNK))
+    LENGTH=$GPUS_PER_CHUNK # Length for slicing, not the end index
+    CHUNK_GPUS=(${GPULIST[@]:$START:$LENGTH})
+    # Convert the chunk GPUs array to a comma-separated string
+    CHUNK_GPUS_STR=$(IFS=,; echo "${CHUNK_GPUS[*]}")
+    ALL_GPUS_FREE=0
+    while [ $ALL_GPUS_FREE -eq 0 ]; do
+        ALL_GPUS_FREE=1  # Assume all GPUs are free initially
+        for GPU_ID in $CHUNK_GPUS; do
+            MEM_USAGE=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i $GPU_ID | tr -d '[:space:]')
+            # Assuming a GPU is considered free if its memory usage is less than 100 MiB
+            if [ "$MEM_USAGE" -ge 100 ]; then
+                ALL_GPUS_FREE=0
+                echo "GPU $GPU_ID is in use. Memory used: ${MEM_USAGE}MiB."
+                break  # Exit the loop early as we found a GPU that is not free
+            fi
+        done
+        if [ $ALL_GPUS_FREE -eq 0 ]; then
+            echo "Not all GPUs in chunk are free. Checking again in 10 seconds..."
+            sleep 10
+        fi
+    done
+    echo "CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR"
+    CUDA_VISIBLE_DEVICES=$CHUNK_GPUS_STR python3 -m tasks.inference_caption \
+        --model_name_or_path $model_name_or_path \
+        --config "configs/tarser2_default_config.yaml" \
+        --max_new_tokens 512 \
+        --top_p 1 \
+        --temperature 0 \
+        --input_file $input_file \
+        --output_dir $output_dir \
+        --output_name predictions \
+        --max_n_samples_per_benchmark -1 \
+        --resume $resume \
+        --num_chunks $CHUNKS \
+        --chunk_idx $(($IDX - 1)) > $output_dir/run_$IDX.log 2>&1 &
+done
+wait
+# python3 -m evaluation.evaluate \
+#     --pred_file $output_dir \
+#     --benchmarks $benchmarks

eval_scripts/DREAM-1K/tarsier/tasks/demo_cli.py ADDED Viewed

	@@ -0,0 +1,116 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+import torch
+from copy import deepcopy
+from transformers import StoppingCriteriaList
+from tasks.utils import load_model_and_processor
+from dataset.utils import *
+from tools.conversation import Chat, conv_templates, StoppingCriteriaSub
+from transformers import TextStreamer
+from tools.color import Color
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+def main(args):
+    # Load Model
+    print(f"### Start loading model...")
+    model, processor = load_model_and_processor(args.model_name_or_path, args.config)
+    print(f"### Finish loading model.")
+    if 'tarsier2' in args.model_name_or_path.lower():
+        conv_type = 'tarsier2-7b'
+    else:
+        if '7b' in args.model_name_or_path.lower():
+            conv_type = 'tarsier-7b'
+        elif '13b' in args.model_name_or_path.lower():
+            conv_type = 'tarsier-13b'
+        elif '34b' in args.model_name_or_path.lower():
+            conv_type = 'tarsier-34b'
+        else:
+            raise ValueError(f"Unknow model: {args.model_name_or_path}")
+    chat = Chat(model, processor, device=device, debug = args.debug)
+    conv = deepcopy(conv_templates[conv_type])
+    img_path = ''
+    has_img = False
+    while True:
+        if not has_img:
+            try:
+                img_path = input(Color.green(f"{conv.roles[1]}: ") +  "Input a file path of your image/video:")
+                img_path = img_path.strip()
+                if not (os.path.exists(img_path) and get_visual_type(img_path) in ['video', 'gif', 'image']):
+                    continue
+                has_img = True
+                conv.messages.append([conv.roles[0], {"type": "video", "text": img_path}])
+                print(Color.green(f"{conv.roles[1]}: ") + "Received your file. Now let's start conversation! :)")
+                print(Color.red(f"<Input \'exit\' to exit and \'reset\' to restart>"))
+            except Exception as e:
+                print(f"Error: {e}")
+                print("exit...")
+                exit()
+        inp = ""
+        while inp == "":
+            inp = input(Color.blue(f"{conv.roles[0]}: ")).strip()
+        if inp.strip() == 'exit':
+            print("exit...")
+            exit()
+        elif inp.strip() == "reset":
+            conv = deepcopy(conv_templates[conv_type])
+            img_path = ''
+            continue
+        conv = chat.ask(inp, conv)
+        stop_words_ids = [torch.tensor([processor.processor.tokenizer.eos_token_id]).to(device)]
+        stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
+        streamer = TextStreamer(processor.processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
+        inputs, conv = chat.prepare_model_inputs(conv, args.max_n_frames)
+        print("conv:", conv)
+        print(Color.green(f"{conv.roles[1]}: "), end="")
+        with torch.inference_mode():
+            outputs = model.generate(
+                **inputs,
+                do_sample=True if args.temperature > 0 else False,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                max_new_tokens=args.max_new_tokens,
+                streamer=streamer,
+                use_cache=True,
+                stopping_criteria=[stopping_criteria])
+        outputs = processor.processor.tokenizer.decode(outputs[0][inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+        conv.messages.append(
+            [conv.roles[1], {"text": outputs, "type": "text"}]
+        )
+        if args.debug:
+            print(f"Conversation state: {conv}")
+if __name__ == "__main__":
+    # python3 -m tasks.demo_cli --model_name_or_path /tmp/tarsier2-1226-dpo --config configs/tarser2_default_config.yaml
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model_name_or_path', type=str)
+    parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+    parser.add_argument("--max_n_frames", type=int, default=16, help="Max number of frames to apply average sampling from the given video.")
+    parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+    parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+    parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+    main(args)

eval_scripts/DREAM-1K/tarsier/tasks/demo_gradio.py ADDED Viewed

	@@ -0,0 +1,230 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# copy and modify from: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/demo/demo.py
+# import spaces # for deploying on huggingface ZeroGPU
+from copy import deepcopy
+import gradio as gr
+from gradio.themes.utils import colors, fonts, sizes
+from tools.conversation import Chat, conv_templates
+from tasks.utils import load_model_and_processor, file_to_base64
+from dataset.tarsier_datamodule import init_processor
+import os
+import torch
+# huggingface-cli login
+model_path = os.getenv("MODEL_PATH", "omni-research/Tarsier2-7b")
+config_path = "configs/tarser2_default_config.yaml"
+max_n_frames = int(os.getenv("MAX_N_FRAMES", 16))
+debug = False
+device = 'cuda' if not debug else 'cpu'
+# ========================================
+#             Model Initialization
+# ========================================
+def init_model():
+    print("Start Initialization...")
+    # if torch.cuda.is_available():
+    if not debug:
+        model, processor = load_model_and_processor(model_path, config_path)
+    else:
+        print(f"No Valid GPU! Lauch in debug mode!")
+        processor = init_processor(model_path, config_path)
+        model = None
+    chat = Chat(model, processor, device, debug)c
+    print('Initialization Finished')
+    return chat
+# ========================================
+#             Gradio Setting
+# ========================================
+def gradio_reset(chat_state, img_file):
+    if chat_state is not None:
+        chat_state.messages = []
+    img_file = None
+    return None, gr.update(value=None, interactive=True), gr.update(value=None, interactive=True), gr.update(value=None, interactive=True), gr.update(placeholder='Please upload your video first', interactive=False),gr.update(value="Upload & Start Chat", interactive=True), chat_state, img_file
+def upload_img(gr_img, gr_video, gr_gif, chat_state, num_frames):
+    print("video, image or gif:", gr_video, gr_img, gr_gif)
+    conv_type = ''
+    if 'tarsier2-7b' in model_path.lower():
+        conv_type = 'tarsier2-7b'
+    # elif '7b' in model_path.lower():
+    #     conv_type = 'tarsier-7b'
+    # elif '13b' in model_path.lower():
+    #     conv_type = 'tarsier-13b'
+    # elif '34b' in model_path.lower():
+    #     conv_type = 'tarsier-34b'
+    else:
+        raise ValueError(f"Unknow model: {model_path}")
+    chat_state = deepcopy(conv_templates[conv_type])
+    if gr_img is None and gr_video is None and gr_gif is None:
+        return None, None, None, gr.update(interactive=True), gr.update(interactive=True, placeholder='Please upload video/image first!'), chat_state, None, None
+    if gr_video or gr_img or gr_gif:
+        for img_file in [gr_video, gr_img, gr_gif]:
+            if img_file is not None:
+                break
+        chat_state.messages.append([chat_state.roles[0], {"type": "video", "text": img_file}])
+        return gr.update(interactive=True), gr.update(interactive=True), gr.update(interactive=True), gr.update(interactive=True, placeholder='Type and press Enter'), gr.update(value="Start Chatting", interactive=False), chat_state, img_file
+def gradio_ask(user_message, chatbot, chat_state):
+    if len(user_message) == 0:
+        return gr.update(interactive=True, placeholder='Input should not be empty!'), chatbot, chat_state
+    chat_state = chat.ask(user_message, chat_state)
+    chatbot = chatbot + [[user_message, None]]
+    return '', chatbot, chat_state
+# @spaces.GPU(duration=120) # for deploying on huggingface ZeroGPU
+def gradio_answer(chatbot, chat_state, img_file, top_p, temperature, n_frames=None):
+    llm_message, chat_state = chat.answer(conv=chat_state, n_frames=n_frames, max_new_tokens=256, num_beams=1, temperature=temperature, top_p=top_p)
+    chatbot[-1][1] = llm_message
+    print(chat_state)
+    print(f"Answer: {llm_message}")
+    return chatbot, chat_state
+class OpenGVLab(gr.themes.base.Base):
+    def __init__(
+        self,
+        *,
+        primary_hue=colors.blue,
+        secondary_hue=colors.sky,
+        neutral_hue=colors.gray,
+        spacing_size=sizes.spacing_md,
+        radius_size=sizes.radius_sm,
+        text_size=sizes.text_md,
+        font=(
+            fonts.GoogleFont("Noto Sans"),
+            "ui-sans-serif",
+            "sans-serif",
+        ),
+        font_mono=(
+            fonts.GoogleFont("IBM Plex Mono"),
+            "ui-monospace",
+            "monospace",
+        ),
+    ):
+        super().__init__(
+            primary_hue=primary_hue,
+            secondary_hue=secondary_hue,
+            neutral_hue=neutral_hue,
+            spacing_size=spacing_size,
+            radius_size=radius_size,
+            text_size=text_size,
+            font=font,
+            font_mono=font_mono,
+        )
+        super().set(
+            body_background_fill="*neutral_50",
+        )
+gvlabtheme = OpenGVLab(primary_hue=colors.blue,
+        secondary_hue=colors.sky,
+        neutral_hue=colors.gray,
+        spacing_size=sizes.spacing_md,
+        radius_size=sizes.radius_sm,
+        text_size=sizes.text_md,
+        )
+logo_b64 = file_to_base64("assets/figures/tarsier_logo.jpg")
+title = f"""<center><a href="https://github.com/bytedance/tarsier"><img src="data:image/jpeg;base64,{logo_b64}" alt="Tarsier" border="0" style="margin: 0 auto; height: 140px;" /></a></center>"""
+description ="""<center><p><a href='https://github.com/bytedance/tarsier'><img src='https://img.shields.io/badge/Github-Code-blue'></a></p><p></center>
+"""
+with gr.Blocks(title="Tarsier",theme=gvlabtheme,css="#chatbot {overflow:auto; height:500px;} #InputVideo {overflow:visible; height:320px;} footer {visibility: none}") as demo:
+    gr.Markdown(title)
+    gr.Markdown(description)
+    with gr.Row():
+        with gr.Column(scale=0.5, visible=True) as video_upload:
+            with gr.Column(elem_id="image", scale=0.5) as img_part:
+                with gr.Tab("Video", elem_id='video_tab'):
+                    up_video = gr.Video(interactive=True, include_audio=True, elem_id="video_upload", height=360)
+                with gr.Tab("Image", elem_id='image_tab'):
+                    up_image = gr.Image(type="filepath", interactive=True, elem_id="image_upload", height=360)
+                with gr.Tab("GIF", elem_id='gif_tab'):
+                    up_gif = gr.File(type="filepath", file_count="single", file_types=[".gif"], interactive=True, elem_id="gif_upload", height=360)
+            upload_button = gr.Button(value="Upload & Start Chat", interactive=True, variant="primary")
+            clear = gr.Button("Restart")
+            # num_beams = gr.Slider(
+            #     minimum=1,
+            #     maximum=10,
+            #     value=1,
+            #     step=1,
+            #     interactive=True,
+            #     label="beam search numbers)",
+            # )
+            temperature = gr.Slider(
+                minimum=0.0,
+                maximum=1.0,
+                value=0.0,
+                step=0.1,
+                interactive=True,
+                label="Temperature",
+            )
+            top_p = gr.Slider(
+                minimum=0.1,
+                maximum=1.0,
+                value=1.0,
+                step=0.1,
+                interactive=True,
+                label="Top_p",
+            )
+            num_frames = gr.Slider(
+                minimum=4,
+                maximum=16,
+                value=16,
+                step=2,
+                interactive=True,
+                label="#Frames",
+            )
+        with gr.Column(visible=True)  as input_raws:
+            chat_state = gr.State()
+            img_file = gr.State()
+            chatbot = gr.Chatbot(elem_id="chatbot",label='VideoChat')
+            with gr.Row():
+                with gr.Column(scale=0.7):
+                    text_input = gr.Textbox(show_label=False, placeholder='Please upload your video first', interactive=False, container=False)
+                with gr.Column(scale=0.15, min_width=0):
+                    run = gr.Button("💭Send")
+                with gr.Column(scale=0.15, min_width=0):
+                    clear = gr.Button("🔄Clear️")
+    chat = init_model()
+    upload_button.click(upload_img, [up_image, up_video, up_gif, chat_state, num_frames], [up_image, up_video, up_gif, text_input, upload_button, chat_state, img_file])
+    text_input.submit(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
+        gradio_answer, [chatbot, chat_state, img_file, top_p, temperature, num_frames], [chatbot, chat_state]
+    )
+    run.click(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
+        gradio_answer, [chatbot, chat_state, img_file, top_p, temperature, num_frames], [chatbot, chat_state]
+    )
+    run.click(lambda: "", None, text_input)
+    clear.click(gradio_reset, [chat_state, img_file], [chatbot, up_image, up_video, up_gif, text_input, upload_button, chat_state, img_file], queue=False)
+demo.launch()
+# demo.launch(server_name="0.0.0.0", server_port=11451)

eval_scripts/DREAM-1K/tarsier/tasks/inference_benchmark.py ADDED Viewed

	@@ -0,0 +1,197 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import torch
+from tasks.utils import load_model_and_processor
+# from dataset.mm_dataset import MMDataset
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.tarsier_datamodule import TarsierDataset
+from dataset.utils import *
+import json
+import os
+import math
+from tqdm import tqdm
+import yaml
+ANN_ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) + '/../data/annotations'
+Benchmark2fname = {
+    'dream': 'DREAM-1k.jsonl',
+    'next-qa': 'Next-QA-val-multi_choice.jsonl',
+    'egoschema': 'EgoSchema_subset.jsonl', # change to EgoSchema_fullset.jsonl if you test on the fullset
+    'mvbench': 'MVBench.jsonl',
+    'tvbench': 'TVBench.jsonl',
+    'video-mme': 'Video-MME.jsonl',
+    'favor-bench': 'FAVOR-Bench.jsonl',
+    'msvd-qa': 'MSVD-QA-val.jsonl',
+    'msr-vtt-qa': 'MSR-VTT-QA-val.jsonl',
+    'tgif-qa': 'TGIF-QA-test.jsonl',
+    'anet-qa': 'ActivityNet-QA-test.jsonl',
+    'msvd-caption': 'MSVD-Caption-test.jsonl',
+    'msr-vtt-caption': 'MSR-VTT-Caption-test.jsonl',
+    'vatex-caption': 'VATEX-test.jsonl',
+    'video_caption': "caption-test.jsonl", # custom for video caption test
+}
+def get_ann_file_path(benchmark):
+    ann_fpath = os.path.join(ANN_ROOT_DIR, Benchmark2fname[benchmark])
+    assert os.path.exists(ann_fpath), f"The annotation file for {benchmark} not exists: {ann_fpath}"
+    return ann_fpath
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i : i + chunk_size] for i in range(0, len(lst), chunk_size)]
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+def parse_args():
+    """
+    Parse command-line arguments.
+    """
+    parser = argparse.ArgumentParser()
+    # Define the command-line arguments
+    parser.add_argument('--model_name_or_path', type=str, required=True)
+    parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+    # parser.add_argument("--max_n_frames", type=int, default=8, help="Max number of frames to apply average sampling from the given video.")
+    parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+    parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+    parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+    parser.add_argument("--output_dir", type=str, help="Directory to save the model results", required=True)
+    parser.add_argument("--output_name", type=str, default="predictions", help="Name of the file for storing results")
+    parser.add_argument("--num_chunks", type=int, default=1)
+    parser.add_argument("--chunk_idx", type=int, default=0)
+    parser.add_argument("--max_n_samples_per_benchmark", type=int, default=-1, help="Set as a small number (like 100) to run as debug.")
+    parser.add_argument('--benchmarks', nargs='+', default=["all"], help="Default as 'all' to inference on all benchmarks; Also could be task types: ('dream', 'caption', 'mc_qa', 'oe_qa'); And specific benchmark names: ('dream', 'msvd-caption', 'msr-vtt-caption', 'vatex-caption', 'next-qa', 'egoschema', 'mvbench', 'video-mme', 'msvd-qa', 'msr-vtt-qa', 'tgif-qa', 'anet-qa')")
+    parser.add_argument("--resume", type=lambda x: (str(x).lower() == 'true'), default=True, help="Resume from existing inference results file or overwrite them.")
+    args = parser.parse_args()
+    args.benchmarks = get_benchmarks(args.benchmarks)
+    print("### Selected Benchmarks:", args.benchmarks)
+    return args
+def run_inference(args):
+    """
+    Run inference on selected benchmarks.
+    Args:
+        args: Command-line arguments.
+    """
+    # Initialize the model
+    # model, processor = load_model_and_processor(args.model_name_or_path, args.max_n_frames) # max_n_frames set in config_file
+    data_config = yaml.safe_load(open(args.config, 'r'))
+    model, processor = load_model_and_processor(args.model_name_or_path, data_config=data_config)
+    all_chunks = []
+    count = 0
+    print(f"Start loading dataset...")
+    for benchmark in args.benchmarks:
+        ann_fpath = get_ann_file_path(benchmark)
+        cur_anns = [json.loads(line) for line in open(ann_fpath)]
+        if args.max_n_samples_per_benchmark > 0:
+            cur_anns = cur_anns[:args.max_n_samples_per_benchmark]
+        count += len(cur_anns)
+        cur_chunk = get_chunk(cur_anns, args.num_chunks, args.chunk_idx)
+        all_chunks.extend(cur_chunk)
+        print(f"### [{benchmark}] Load chunk with {len(cur_chunk)} samples from {len(cur_anns)} samples.")
+    print(f"### Finish loading chunk with {len(all_chunks)} samples from {count} samples in total.")
+    # Create the output directory if it doesn't exist
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    if args.num_chunks > 1:
+        output_name = f"{args.output_name}_{args.num_chunks}_{args.chunk_idx}"
+    else:
+        output_name = args.output_name
+    answers_file = os.path.join(args.output_dir, f"{output_name}.jsonl")
+    if args.resume and os.path.exists(answers_file):
+        processed_data = [json.loads(line) for line in open(answers_file)]
+        processed_idxs = set([f"{d['dataset']}-{d['idx']}" for d in processed_data])
+        all_chunks = [d for d in all_chunks if f"{d['dataset']}-{d['idx']}" not in processed_idxs]
+        print(f"### Resume from {len(processed_idxs)} samples. {len(all_chunks)} samples to run.", flush=True)
+        ans_file = open(answers_file, "a")
+    else:
+        ans_file = open(answers_file, "w")
+    dataset = TarsierDataset(
+        anns=all_chunks, config=data_config, processor=processor
+    )
+    generate_kwargs = {
+        "do_sample": True if args.temperature > 0 else False,
+        "max_new_tokens": args.max_new_tokens,
+        "top_p": args.top_p,
+        "temperature": args.temperature,
+        "use_cache": True
+    }
+    if len(dataset) == 0:
+        return
+    for ann, inputs in tqdm(dataset, total=len(dataset)):
+        if inputs is not None:
+            if "prompt" in inputs:
+                prompt = get_prompt_from_data_dict(ann)
+                print(f"###Prompt:\n{prompt}", flush=True)
+                # print(f"Input: {processor.processor.tokenizer.decode(inputs['input_ids'][0]), skip_special_tokens=True}", flush=True)
+            try:
+                model_inputs = {}
+                for k, v in inputs.items():
+                    if not isinstance(v, torch.Tensor):
+                        continue
+                    model_inputs[k] = v.to(model.device)
+                outputs = model.generate(
+                    **model_inputs,
+                    **generate_kwargs,
+                )
+                output_text = processor.processor.tokenizer.decode(outputs[0][model_inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+            except Exception as e:
+                print(f"Error: {e}")
+                output_text = "<error>"
+            print(f"###Prediction:\n{output_text}", flush=True)
+            answer = ann['messages'][-1]['content'][-1]['reference']
+            print(f"###Answer:\n{answer}", flush=True)
+            put_pred_to_data_dict(output_text, ann)
+        else:
+            put_pred_to_data_dict("<error>", ann)
+        try:
+            ans_file.write(json.dumps(ann, ensure_ascii=False) + "\n")
+        except:
+            ans_file.write(json.dumps(ann) + "\n")
+        ans_file.flush()
+    ans_file.close()
+if __name__ == "__main__":
+    args = parse_args()
+    run_inference(args)

eval_scripts/DREAM-1K/tarsier/tasks/inference_caption.py ADDED Viewed

	@@ -0,0 +1,165 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import torch
+from tasks.utils import load_model_and_processor
+# from dataset.mm_dataset import MMDataset
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.tarsier_datamodule import TarsierDataset
+from dataset.utils import *
+import json
+import os
+import math
+from tqdm import tqdm
+import yaml
+ANN_ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) + '/../data/annotations'
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i : i + chunk_size] for i in range(0, len(lst), chunk_size)]
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+def parse_args():
+    """
+    Parse command-line arguments.
+    """
+    parser = argparse.ArgumentParser()
+    # Define the command-line arguments
+    parser.add_argument('--model_name_or_path', type=str, required=True)
+    parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+    # parser.add_argument("--max_n_frames", type=int, default=8, help="Max number of frames to apply average sampling from the given video.")
+    parser.add_argument("--max_new_tokens", type=int, default=512, help="max number of generated tokens")
+    parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+    parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+    parser.add_argument("--input_file", type=str, help="Directory to input_file (jsonline)", required=True)
+    parser.add_argument("--output_dir", type=str, help="Directory to save the model results", required=True)
+    parser.add_argument("--output_name", type=str, default="predictions", help="Name of the file for storing results")
+    parser.add_argument("--num_chunks", type=int, default=1)
+    parser.add_argument("--chunk_idx", type=int, default=0)
+    parser.add_argument("--max_n_samples_per_benchmark", type=int, default=-1, help="Set as a small number (like 100) to run as debug.")
+    parser.add_argument("--resume", type=lambda x: (str(x).lower() == 'true'), default=True, help="Resume from existing inference results file or overwrite them.")
+    args = parser.parse_args()
+    return args
+def run_inference(args):
+    """
+    Run inference on selected benchmarks.
+    Args:
+        args: Command-line arguments.
+    """
+    # Initialize the model
+    # model, processor = load_model_and_processor(args.model_name_or_path, args.max_n_frames) # max_n_frames set in config_file
+    data_config = yaml.safe_load(open(args.config, 'r'))
+    model, processor = load_model_and_processor(args.model_name_or_path, data_config=data_config)
+    all_chunks = []
+    count = 0
+    print(f"Start loading dataset...")
+    ann_fpath = args.input_file
+    cur_anns = [json.loads(line) for line in open(ann_fpath)]
+    if args.max_n_samples_per_benchmark > 0:
+        cur_anns = cur_anns[:args.max_n_samples_per_benchmark]
+    count += len(cur_anns)
+    cur_chunk = get_chunk(cur_anns, args.num_chunks, args.chunk_idx)
+    all_chunks.extend(cur_chunk)
+    print(f"### Load chunk with {len(cur_chunk)} samples from {len(cur_anns)} samples.")
+    print(f"### Finish loading chunk with {len(all_chunks)} samples from {count} samples in total.")
+    # Create the output directory if it doesn't exist
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    if args.num_chunks > 1:
+        output_name = f"{args.output_name}_{args.num_chunks}_{args.chunk_idx}"
+    else:
+        output_name = args.output_name
+    answers_file = os.path.join(args.output_dir, f"{output_name}.jsonl")
+    if args.resume and os.path.exists(answers_file):
+        processed_data = [json.loads(line) for line in open(answers_file)]
+        processed_idxs = set([f"{d['dataset']}-{d['idx']}" for d in processed_data])
+        all_chunks = [d for d in all_chunks if f"{d['dataset']}-{d['idx']}" not in processed_idxs]
+        print(f"### Resume from {len(processed_idxs)} samples. {len(all_chunks)} samples to run.", flush=True)
+        ans_file = open(answers_file, "a")
+    else:
+        ans_file = open(answers_file, "w")
+    dataset = TarsierDataset(
+        anns=all_chunks, config=data_config, processor=processor
+    )
+    generate_kwargs = {
+        "do_sample": True if args.temperature > 0 else False,
+        "max_new_tokens": args.max_new_tokens,
+        "top_p": args.top_p,
+        "temperature": args.temperature,
+        "use_cache": True
+    }
+    if len(dataset) == 0:
+        return
+    for ann, inputs in tqdm(dataset):
+        if inputs is not None:
+            prompt = get_prompt_from_data_dict(ann)
+            print(f"###Prompt:\n{prompt}", flush=True)
+            # print(f"Input: {processor.processor.tokenizer.decode(inputs['input_ids'][0]), skip_special_tokens=True}", flush=True)
+            try:
+                model_inputs = {}
+                for k, v in inputs.items():
+                    if not isinstance(v, torch.Tensor):
+                        continue
+                    model_inputs[k] = v.to(model.device)
+                outputs = model.generate(
+                    **model_inputs,
+                    **generate_kwargs,
+                )
+                output_text = processor.processor.tokenizer.decode(outputs[0][model_inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+            except Exception as e:
+                print(f"Error: {e}")
+                output_text = "<error>"
+            print(f"###Prediction:\n{output_text}", flush=True)
+            put_pred_to_data_dict(output_text, ann)
+        else:
+            put_pred_to_data_dict("<error>", ann)
+        try:
+            ans_file.write(json.dumps(ann, ensure_ascii=False) + "\n")
+        except:
+            ans_file.write(json.dumps(ann) + "\n")
+        ans_file.flush()
+    ans_file.close()
+if __name__ == "__main__":
+    # python3 -m tasks.inference_caption --model_name_or_path /tmp/tarsier2-1226-dpo --config configs/tarser2_default_config.yaml --input_file  data/annotations/caption-test-new.jsonl --output_dir tmp_outputs
+    args = parse_args()
+    run_inference(args)

eval_scripts/DREAM-1K/tarsier/tasks/inference_quick_start.py ADDED Viewed

	@@ -0,0 +1,91 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from tasks.utils import load_model_and_processor
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.utils import *
+import os
+import torch
+from tqdm import tqdm
+import yaml
+def process_one(model, processor, prompt, video_file, generate_kwargs):
+    # inputs = processor(prompt, video_file, edit_prompt=True, return_prompt=True)
+    sample = format_one_sample(video_file, prompt)
+    batch_data = processor(sample)
+    print(f"###Prompt:\n{get_prompt_from_data_dict(sample)}")
+    model_inputs = {}
+    for k, v in batch_data.items():
+        if not isinstance(v, torch.Tensor):
+            continue
+        model_inputs[k] = v.to(model.device)
+    outputs = model.generate(
+        **model_inputs,
+        **generate_kwargs,
+    )
+    # print(processor.processor.tokenizer.decode(outputs[0][:model_inputs['input_ids'][0].shape[0]], skip_special_tokens=True))
+    output_text = processor.processor.tokenizer.decode(outputs[0][model_inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+    return output_text
+def run():
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model_name_or_path', type=str)
+    parser.add_argument('--config', type=str, default="configs/tarser2_default_config.yaml")
+    parser.add_argument('--instruction', type=str, default="Describe the video in detail.", help='Input prompt.')
+    parser.add_argument('--input_path', type=str, default="assets/examples", help='Path to video/image; or Dir to videos/images')
+    # parser.add_argument("--max_n_frames", type=int, default=16, help="Max number of frames to apply average sampling from the given video.")
+    parser.add_argument("--max_new_tokens", type=int, default=256, help="max number of generated tokens")
+    parser.add_argument("--top_p", type=float, default=1, help="Top_p sampling")
+    parser.add_argument("--temperature", type=float, default=0, help="Set temperature > 0 to enable sampling generation.")
+    args = parser.parse_args()
+    # model, processor = load_model_and_processor(args.model_name_or_path, max_n_frames=args.max_n_frames) # max_n_frames set in config_file
+    data_config = yaml.safe_load(open(args.config, 'r'))
+    model, processor = load_model_and_processor(args.model_name_or_path, data_config=data_config)
+    generate_kwargs = {
+        "do_sample": True if args.temperature > 0 else False,
+        "max_new_tokens": args.max_new_tokens,
+        "top_p": args.top_p,
+        "temperature": args.temperature,
+        "use_cache": True
+    }
+    assert os.path.exists(args.input_path), f"input_path not exist: {args.input_path}"
+    if os.path.isdir(args.input_path):
+        input_files = [os.path.join(args.input_path, fn) for fn in os.listdir(args.input_path) if get_visual_type(fn) in ['video', 'gif', 'image']]
+    elif get_visual_type(args.input_path) in ['video', 'gif', 'image']:
+        input_files = [args.input_path]
+    assert len(input_files) > 0, f"None valid input file in: {args.input_path} {VALID_DATA_FORMAT_STRING}"
+    for input_file in tqdm(input_files, desc="Generating..."):
+        visual_type = get_visual_type(input_file)
+        if args.instruction:
+            prompt = args.instruction
+        else:
+            if visual_type == 'image':
+                prompt = "Describe the image in detail."
+            else:
+                prompt = "Describe the video in detail."
+        pred = process_one(model, processor, prompt, input_file, generate_kwargs)
+        print(f"###Prediction:\n{pred}")
+        print('-'*100)
+if __name__ == "__main__":
+    # python3 -m tasks.inference_quick_start --model_name_or_path /tmp/tarsier2-1226-dpo --config configs/tarser2_default_config.yaml --input_path /mnt/bn/videonasi18n/wangjw/workspace/tarsier/diving.mp4 --instruction "List the names of all sponsors on the background wall."
+    run()

eval_scripts/DREAM-1K/tarsier/tasks/utils.py ADDED Viewed

	@@ -0,0 +1,45 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from models.modeling_tarsier import TarsierForConditionalGeneration, LlavaConfig
+# from dataset.processor import Processor
+from dataset.tarsier_datamodule import init_processor
+import torch
+import base64
+from tools.color import Color
+import yaml
+def load_model_and_processor(model_name_or_path, data_config):
+    print(Color.red(f"Load model and processor from: {model_name_or_path}"), flush=True)
+    if isinstance(data_config, str):
+        data_config = yaml.safe_load(open(data_config, 'r'))
+    processor = init_processor(model_name_or_path, data_config)
+    model_config = LlavaConfig.from_pretrained(
+        model_name_or_path,
+        trust_remote_code=True,
+    )
+    model = TarsierForConditionalGeneration.from_pretrained(
+        model_name_or_path,
+        config=model_config,
+        device_map='auto',
+        torch_dtype=torch.bfloat16,
+        trust_remote_code=True
+    )
+    model.eval()
+    return model, processor
+def file_to_base64(img_path):
+    with open(img_path, 'rb') as video_file:
+        video_b64_str = base64.b64encode(video_file.read()).decode()
+    return video_b64_str

eval_scripts/DREAM-1K/tarsier/tools/color.py ADDED Viewed

	@@ -0,0 +1,36 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+class Color:
+    @staticmethod
+    def red(x):
+        return '\33[31m' +x + '\033[0m'
+    @staticmethod
+    def green(x):
+        return '\33[32m' +x + '\033[0m'
+    @staticmethod
+    def yellow(x):
+        return '\33[33m' +x + '\033[0m'
+    @staticmethod
+    def blue(x):
+        return '\33[34m' +x + '\033[0m'
+    @staticmethod
+    def violet(x):
+        return '\33[35m' +x + '\033[0m'

eval_scripts/DREAM-1K/tarsier/tools/conversation.py ADDED Viewed

	@@ -0,0 +1,256 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# copy and modify from: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/conversation.py
+from PIL import Image
+import torch
+from transformers import StoppingCriteria, StoppingCriteriaList
+from dataset.custom_data_parsers.utils import put_pred_to_data_dict, get_prompt_from_data_dict
+from dataset.tarsier_datamodule import TarsierDataProcessor
+from dataset.utils import *
+from enum import auto, Enum
+import os
+import re
+data_dict_tmp = {
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "video",
+                    "video": {
+                        "video_file": "/mnt/hdfs/vlm/videos/movies_aligned_0523/tt8266310/tt8266310_1.50.24-1.50.29.mp4"}
+                },
+                {
+                    "type": "text",
+                    "text": "Describe the video in detail."
+                }
+            ]
+        },
+        {
+            "role": "assistant",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "A man in the driver's seat, wearing a black jacket with a maroon shirt, fastens his seatbelt while smiling at the man in the passenger seat, who is adjusting his position. The passenger, also wearing a black jacket with a maroon shirt, turns to look forward and smiles. The driver then leans forward to start the car and leans back in his seat. In the background, a beige car is visible through the window."
+            }]}
+    ],
+    "dataset": "video_caption",
+    "task": "video/caption",
+    "idx": 0,
+}
+IMAGE_TOKEN = "<image>"
+VIDEO_TOKEN = "<video>"
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+    TWO = auto()
+def get_data_dict(conv, max_n_frames=None):
+    data_dict = {
+        "messages": []
+    }
+    for i, (role, message) in enumerate(conv.messages):
+        if message:
+            text = message["text"]
+            content_type = message["type"]
+            content = {}
+            if content_type == "text":
+                content['type'] = 'text'
+                content['text'] = text
+                task = "text-only"
+            elif content_type == "video":
+                content['type'] = 'video'
+                content['video'] = {
+                    "video_file": text
+                }
+                if max_n_frames is not None:
+                    content['video']['n_frames'] = max_n_frames
+                task = "video/QA"
+            elif content_type == "image":
+                content['type'] = 'image'
+                content['image'] = {
+                    "image_file": text
+                }
+                task = "image/QA"
+            else:
+                content['type'] = 'text'
+                content['text'] = text
+                task = "text-only"
+            if data_dict['messages'] and data_dict['messages'][-1]['role'] == role:
+                data_dict['messages'][-1]['content'].append(content)
+            else:
+                data_dict['messages'].append({
+                    "role": role,
+                    "content": [content]
+                })
+    data_dict['dataset'] = task
+    data_dict['task'] = task
+    check_data_format(data_dict)
+    return data_dict
+class StoppingCriteriaSub(StoppingCriteria):
+    def __init__(self, stops=[], encounters=1):
+        super().__init__()
+        self.stops = stops
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
+        for stop in self.stops:
+            if torch.all((stop == input_ids[0][-len(stop):])).item():
+                return True
+        return False
+class Chat:
+    def __init__(self, model, processor: TarsierDataProcessor, device='cuda', debug=False):
+        self.model = model
+        self.processor = processor
+        self.device = device
+        self.debug = debug
+        stop_words_ids = [torch.tensor([self.processor.processor.tokenizer.eos_token_id]).to(device)]
+        self.stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
+    def ask(self,text,conv):
+        conv.messages.append([conv.roles[0], {"text": text, "type": "text"}])
+        return conv
+    def prepare_model_inputs(self, conv, n_frames=None):
+        # print(conv.messages)
+        data_dict = get_data_dict(conv, n_frames)
+        if self.debug:
+            # print(f"visual_data_file: {visual_data_file}", flush=True)
+            print(f"###Prompt:\n{get_prompt_from_data_dict(data_dict)}")
+        batch_data = self.processor(data_dict)
+        model_inputs = {}
+        for k, v in batch_data.items():
+            if not isinstance(v, torch.Tensor):
+                continue
+            model_inputs[k] = v.to(self.device)
+        return model_inputs, conv
+    def answer(self, conv, n_frames=None, max_new_tokens=256, num_beams=1, min_length=1, top_p=1.0,
+               repetition_penalty=1.0, length_penalty=1, temperature=0):
+        inputs, conv = self.prepare_model_inputs(conv, n_frames)
+        if self.model is not None:
+            outputs = self.model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens,
+                stopping_criteria=self.stopping_criteria,
+                num_beams=num_beams,
+                do_sample=True if temperature > 0 else False,
+                min_length=min_length,
+                top_p=top_p,
+                repetition_penalty=repetition_penalty,
+                length_penalty=length_penalty,
+                temperature=temperature,
+            )
+            output_text = self.processor.processor.tokenizer.decode(outputs[0][inputs['input_ids'][0].shape[0]:], skip_special_tokens=True)
+        else:
+            output_text = "Fake respone as launched in debug mode!"
+        conv.messages.append(
+            [conv.roles[1], {"text": output_text, "type": "text"}]
+        )
+        return output_text, conv
+class EasyDict(dict):
+    """
+    Get attributes
+    >>> d = EasyDict({'foo':3})
+    >>> d['foo']
+    3
+    >>> d.foo
+    3
+    >>> d.bar
+    Traceback (most recent call last):
+    ...
+    AttributeError: 'EasyDict' object has no attribute 'bar'
+    Works recursively
+    >>> d = EasyDict({'foo':3, 'bar':{'x':1, 'y':2}})
+    >>> isinstance(d.bar, dict)
+    True
+    >>> d.bar.x
+    1
+    """
+    def __init__(self, d=None, **kwargs):
+        if d is None:
+            d = {}
+        if kwargs:
+            d.update(**kwargs)
+        for k, v in d.items():
+            setattr(self, k, v)
+        # Class attributes
+        for k in self.__class__.__dict__.keys():
+            if not (k.startswith("__") and k.endswith("__")) and not k in ("update", "pop"):
+                setattr(self, k, getattr(self, k))
+    def __setattr__(self, name, value):
+        if isinstance(value, (list, tuple)):
+            value = [self.__class__(x) if isinstance(x, dict) else x for x in value]
+        elif isinstance(value, dict) and not isinstance(value, self.__class__):
+            value = self.__class__(value)
+        super(EasyDict, self).__setattr__(name, value)
+        super(EasyDict, self).__setitem__(name, value)
+    __setitem__ = __setattr__
+    def update(self, e=None, **f):
+        d = e or dict()
+        d.update(f)
+        for k in d:
+            setattr(self, k, d[k])
+    def pop(self, k, d=None):
+        if hasattr(self, k):
+            delattr(self, k)
+        return super(EasyDict, self).pop(k, d)
+conv_tarsier = EasyDict({
+    "system": "",
+    "roles": ("USER", "ASSISTANT"),
+    "messages": [],
+    "sep1": " ",
+    "sep2": "</s>",
+}
+)
+conv_tarsier_yi = EasyDict({
+    "system": "",
+    "roles": ("USER", "ASSISTANT"),
+    "messages": [],
+    "sep1": " ",
+    "sep2": "<|endoftext|>",
+}
+)
+conv_tarsier_qwen2_vl = EasyDict({
+    "system": "",
+    "roles": ("user", "assistant"),
+    "messages": [],
+}
+)
+conv_templates = {
+    "tarsier2-7b": conv_tarsier_qwen2_vl
+}

eval_scripts/DREAM-1K/tarsier/tools/ptbtokenizer.py ADDED Viewed

	@@ -0,0 +1,66 @@

+#!/usr/bin/env python
+#
+# File Name : ptbtokenizer.py
+#
+# Description : Do the PTB Tokenization and remove punctuations.
+#
+# Creation Date : 29-12-2014
+# Last Modified : Thu Mar 19 09:53:35 2015
+# Authors : Hao Fang <hfang@uw.edu> and Tsung-Yi Lin <tl483@cornell.edu>
+import os
+import subprocess
+import tempfile
+# path to the stanford corenlp jar
+STANFORD_CORENLP_3_4_1_JAR = os.path.dirname(os.path.abspath(__file__)) + '/stanford-corenlp-3.4.1.jar'
+# punctuations to be removed from the sentences
+PUNCTUATIONS = ["''", "'", "``", "`", "-LRB-", "-RRB-", "-LCB-", "-RCB-", \
+        ".", "?", "!", ",", ":", "-", "--", "...", ";"]
+class PTBTokenizer:
+    """Python wrapper of Stanford PTBTokenizer"""
+    def tokenize(self, captions_for_image):
+        cmd = [os.getenv("JAVA_HOME"), '-cp', STANFORD_CORENLP_3_4_1_JAR, \
+                'edu.stanford.nlp.process.PTBTokenizer', \
+                '-preserveLines', '-lowerCase']
+        # ======================================================
+        # prepare data for PTB Tokenizer
+        # ======================================================
+        final_tokenized_captions_for_image = {}
+        image_id = [k for k, v in captions_for_image.items() for _ in range(len(v))]
+        sentences = '\n'.join([c['caption'].replace('\n', ' ') for k, v in captions_for_image.items() for c in v])
+        # ======================================================
+        # save sentences to temporary file
+        # ======================================================
+        path_to_jar_dirname=os.path.dirname(os.path.abspath(__file__))
+        tmp_file = tempfile.NamedTemporaryFile(delete=False, dir=path_to_jar_dirname)
+        tmp_file.write(sentences.encode())
+        tmp_file.close()
+        # ======================================================
+        # tokenize sentence
+        # ======================================================
+        cmd.append(os.path.basename(tmp_file.name))
+        p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname, \
+                stdout=subprocess.PIPE)
+        token_lines = p_tokenizer.communicate(input=sentences.rstrip())[0]
+        token_lines = token_lines.decode()
+        lines = token_lines.split('\n')
+        # remove temp file
+        os.remove(tmp_file.name)
+        # ======================================================
+        # create dictionary for tokenized captions
+        # ======================================================
+        for k, line in zip(image_id, lines):
+            if not k in final_tokenized_captions_for_image:
+                final_tokenized_captions_for_image[k] = []
+            tokenized_caption = ' '.join([w for w in line.rstrip().split(' ') \
+                    if w not in PUNCTUATIONS])
+            final_tokenized_captions_for_image[k].append(tokenized_caption)
+        return final_tokenized_captions_for_image

eval_scripts/DREAM-1K/tarsier/tools/rw_utils.py ADDED Viewed

	@@ -0,0 +1,64 @@

+# Copyright (2024) Bytedance Ltd. and/or its affiliates
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+from json import JSONEncoder
+import numpy
+import pandas as pd
+class NumpyArrayEncoder(JSONEncoder):
+    def default(self, obj):
+        if isinstance(obj, numpy.ndarray):
+            return obj.tolist()
+        return JSONEncoder.default(self, obj)
+def write_txt(data, path):
+    with open(path, 'w', encoding='utf-8')as f:
+        for d in data:
+            f.write(f'{d}\n')
+def read_txt(path):
+    with open(path, 'r', encoding='utf-8', errors='ignore') as f:
+        lines = [l.strip('\n') for l in f.readlines()]
+        return lines
+def read_jsonlines(path):
+    objs = []
+    with open(path) as f:
+        for line in f:
+            line = json.loads(line)
+            objs.append(line)
+    return objs
+def write_jsonlines(data, path, cls=None, ensure_ascii=False):
+    with open(path, 'w') as f:
+        for d in data:
+            d = json.dumps(d, ensure_ascii=ensure_ascii, cls=cls)
+            f.write(d)
+            f.write('\n')
+def read_parquet(path):
+    data = pd.read_parquet(path)
+    return data.to_dict('records')
+def write_parquet(data, path):
+    data = pd.DataFrame(data)
+    data.to_parquet(path)
+def read_csv(path):
+    data = pd.read_csv(path)
+    return data.to_dict(orient='records')
+def write_csv(data, path):
+    data = pd.DataFrame(data)
+    data.to_csv(path, index=False, sep='\t')

eval_scripts/Daily-Omni/Daily-Omni_pipeline.sh ADDED Viewed

	@@ -0,0 +1,62 @@

+#!/bin/bash
+MODEL_PATHS=(
+    "path_to_AVoCaDO"
+)
+RESULTS_DIR="$1"
+ORIGINAL_FILE="eval_scripts/Daily-Omni/grouped_data.json"
+MERGED_FILE="$RESULTS_DIR/captioned_results.json"
+if [ ! -f "$MERGED_FILE" ]; then
+    echo "MERGED_FILE not found. Creating from ORIGINAL_FILE..."
+    cp "$ORIGINAL_FILE" "$MERGED_FILE"
+fi
+CAPTION_FILES_TO_MERGE=()
+CAPTION_KEYS=()
+# Step 1: caption geneartion
+for model_path in "${MODEL_PATHS[@]}"; do
+    CLEAN_PATH="${model_path%/}"
+    model_name=$(basename "$CLEAN_PATH")
+    caption_file="$RESULTS_DIR/${model_name}_caption.jsonl"
+    echo "Output caption file will be: $caption_file"
+    python eval_scripts/Daily-Omni/generate_caption.py \
+        --model_path "$model_path" \
+        --fout_path "$caption_file"
+    if [ -f "$caption_file" ]; then
+        CAPTION_FILES_TO_MERGE+=("$caption_file")
+        CAPTION_KEYS+=("${model_name}_caption")
+    else
+        echo "Error: Caption file $caption_file not generated for model $model_path."
+        exit 1
+    fi
+done
+# Step 2: merge generated caption files
+echo "Merging all generated caption files..."
+python eval_scripts/Daily-Omni/merge_captions.py \
+    --original_file "$MERGED_FILE" \
+    --caption_files "${CAPTION_FILES_TO_MERGE[@]}" \
+    --merged_file "$MERGED_FILE"
+# Step 3: evaluation
+python eval_scripts/Daily-Omni/evaluation.py \
+    --merged_file "$MERGED_FILE" \
+    --caption_keys "${CAPTION_KEYS[@]}"
+# Step 4: analysis and save evaluation results
+for caption_key in "${CAPTION_KEYS[@]}"; do
+    echo "Running analysis for caption key: $caption_key"
+    result_file="$RESULTS_DIR/${caption_key}_result.jsonl"
+    answer_key="${caption_key//_caption/_resp}"
+    if [ -f "$result_file" ]; then
+        python eval_scripts/Daily-Omni/analysis.py --result_file_path "$result_file" --answer_key "$answer_key"
+    else
+        echo "Warning: Result file '$result_file' not found for analysis."
+    fi
+done

eval_scripts/Daily-Omni/analysis.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import pandas as pd
+import argparse
+import os
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Analyze the evaluation results.")
+    parser.add_argument("--result_file_path", type=str, required=True, help="Path to the result file (.jsonl).")
+    parser.add_argument("--answer_key", type=str, required=True, help="The key for the model's response in the result file.")
+    args = parser.parse_args()
+    data = pd.read_json(args.result_file_path, lines=True)
+    acc = (data['answer'].str.upper() == data[args.answer_key].str.upper()).mean()
+    print(f"Accuracy for {args.answer_key} is: {acc:.2%}")
+    with open(f"{os.path.dirname(args.result_file_path)}/{args.answer_key}.log", "w", encoding='utf-8') as fout:
+        fout.write(f"Accuracy for {args.answer_key} is: {acc:.2%}")

eval_scripts/Daily-Omni/evaluation.py ADDED Viewed

	@@ -0,0 +1,225 @@

+###
+# using a llm to answer questions regarding to the video with the specific caption
+###
+import os
+os.environ['GOOGLE_APPLICATION_CREDENTIALS']=''
+LOCATION = "global"
+user_info_path = ''
+user_info = json.load(open(user_info_path))
+PROJECT_ID = user_info['project_id']
+MODEL = "gemini-2.5-pro"
+import sys
+import time
+import json
+import traceback
+import multiprocessing
+import random
+import numpy as np
+import argparse
+from google import genai
+from google.genai import types
+from IPython.display import HTML, Image, Markdown, display
+from google import genai
+from google.genai.types import (
+    FunctionDeclaration,
+    GenerateContentConfig,
+    GoogleSearch,
+    HarmBlockThreshold,
+    HarmCategory,
+    Part,
+    SafetySetting,
+    ThinkingConfig,
+    Tool,
+    ToolCodeExecution,
+)
+import subprocess
+safety_settings = [
+    SafetySetting(category=HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=HarmBlockThreshold.OFF),
+    SafetySetting(category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=HarmBlockThreshold.OFF),
+    SafetySetting(category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=HarmBlockThreshold.OFF),
+    SafetySetting(category=HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=HarmBlockThreshold.OFF)
+]
+CONFIG = types.GenerateContentConfig(
+    temperature=0,
+    top_p=0.001,
+    thinking_config=types.ThinkingConfig(
+      include_thoughts=True,
+      thinking_budget=512
+    ),
+    safety_settings=safety_settings,
+    seed=SEED,
+    system_instruction='''
+    You are a precise QA assistant. Your task is to answer multiple-choice questions based ONLY on the video caption provided.
+    Do not use any outside knowledge or assumptions—your answer must strictly reflect information from the caption.
+    Always output only the capital letter corresponding to your choice (e.g., A, B, C, D).
+    If the caption does not provide enough information to answer the question, output "N/A" instead.
+    '''
+)
+client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
+def set_seed(seed):
+    np.random.seed(seed)
+    random.seed(seed)
+SEED = 42
+set_seed(SEED)
+def caption2json(json_path, caption_path):
+    with open(json_path, 'r', encoding='utf-8') as f:
+        json_data = json.load(f)
+    model = os.path.basename(caption_path).split("_")[0]
+    captions = {}
+    with open(caption_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            if not line.strip():
+                continue
+            item = json.loads(line)
+            for vid, cap in item.items():
+                captions[vid] = cap
+    for entry in json_data:
+        vid = entry.get("video_id")
+        if vid in captions:
+            entry[f"{model}_caption"] = captions[vid]
+    with open(f"{model}_merge_data.json", 'w', encoding='utf-8') as f:
+        json.dump(json_data, f, ensure_ascii=False, indent=2)
+    print(f"merged successfully, the output file is {model}_merge_data.json")
+def generate(prompt):
+    contents = [prompt]
+    answer, thinking = None, None
+    max_retries = 10
+    for i in range(max_retries):
+        try:
+            response = client.models.generate_content(
+                model=MODEL,
+                contents=contents,
+                config=CONFIG
+                )
+            answer_parts, thought_parts = [], []
+            for part in response.candidates[0].content.parts:
+                if not getattr(part, "text", None):
+                    continue
+                if getattr(part, "thought", False):
+                    thought_parts.append(part.text)
+                else:
+                    answer_parts.append(part.text)
+            answer = "\n".join(answer_parts).strip()
+            thinking = "\n".join(thought_parts).strip()
+            if answer:
+                break
+            else:
+                print(f"[WARN] Attempt {i+1}: empty answer, retrying ... ")
+                time.sleep(3)
+        except Exception as e:
+            print(f"[ERROR] Attempt {i+1} failed: {e}")
+            traceback.print_exc()
+            time.sleep(3)
+    if not answer:
+        return None, None
+    print(answer)
+    return answer, thinking
+def worker(task):
+    vid, video_duration, question, choices, answer, caption_key, answer_key, caption = task
+    choices_text = "\n".join([f"{c}" for c in choices])
+    prompt_filled = f'''
+Here is the video caption:
+"{caption}"
+Question: {question}
+Choices:
+    {choices_text}'''
+    try:
+        resp, _ = generate(prompt_filled)
+        return {
+            "video_id": vid,
+            "video_duration": video_duration,
+            "question": question,
+            "choices": choices,
+            "answer": answer,
+            caption_key: caption,
+            answer_key: resp
+        }
+    except Exception as e:
+        traceback.print_exc()
+        return {
+            "video_id": vid,
+            "video_duration": video_duration,
+            "question": question,
+            "choices": choices,
+            "answer": answer,
+            caption_key: caption,
+            answer_key: None
+        }
+def run_multiprocess_tasks(tasks, num_processes=None, fout_path=None):
+    if num_processes is None:
+        num_processes = multiprocessing.cpu_count()
+    with multiprocessing.Pool(processes=num_processes) as pool:
+        results = pool.map(worker, tasks)
+    if fout_path:
+        with open(fout_path, "w", encoding='utf-8') as f:
+            for item in results:
+                f.write(json.dumps(item, ensure_ascii=False) + '\n')
+                f.flush()
+    return results
+def eval_dailyomni_caption_qas(file_path, caption_keys=["omni_caption"]):
+    with open(file_path, 'r', encoding='utf-8') as f:
+        data = json.load(f)
+    all_results = []
+    for caption_key in caption_keys:
+        answer_key = caption_key.replace("_caption", "_resp")
+        fout_path = f"{os.path.dirname(file_path)}/{caption_key}_result.jsonl"
+        tasks = []
+        for video_info in data:
+            vid = video_info["video_id"]
+            video_duration = video_info["video_duration"]
+            caption = video_info[caption_key]
+            for q in video_info["questions"]:
+                task_item = (
+                    vid,
+                    video_duration,
+                    q["Question"],
+                    q["Choice"],
+                    q["Answer"],
+                    caption_key,
+                    answer_key,
+                    caption
+                )
+                tasks.append(task_item)
+        results = run_multiprocess_tasks(tasks, num_processes=20, fout_path=fout_path)
+        all_results.extend(results)
+    return all_results
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Evaluate captions using Gemini.")
+    parser.add_argument("--merged_file", type=str, required=True, help="Path to the merged caption file.")
+    parser.add_argument(
+        "--caption_keys",
+        type=str,
+        nargs='+',
+        required=True,
+        help="A list of caption keys to evaluate"
+    )
+    args = parser.parse_args()
+    eval_dailyomni_caption_qas(args.merged_file, caption_keys=args.caption_keys)

eval_scripts/Daily-Omni/generate_caption.py ADDED Viewed

	@@ -0,0 +1,142 @@

+import os
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+import argparse
+import json
+from tqdm import tqdm
+from pathlib import Path
+import multiprocessing as mp
+import traceback
+import random
+import glob
+VIDEO_MAX_PIXELS = 401408  # 512*28*28
+VIDEO_TOTAL_PIXELS = 20070400  # 512*28*28*50
+USE_AUDIO_IN_VIDEO = True
+video_base_dir = "path_to_Daily-Omni_Videos"
+os.environ['VIDEO_MAX_PIXELS'] = str(VIDEO_TOTAL_PIXELS)
+def chat(file_path, prompt, model, processor, model_path, max_new_tokens=2048):
+    conversation = [
+        {
+            "role": "system",
+            "content": [
+                {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
+            ],
+        },
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "video",
+                    "video": file_path,
+                    "max_pixels": VIDEO_MAX_PIXELS,
+                    "max_frames": 256
+                },
+                {
+                    "type": "text",
+                    "text": prompt
+                },
+            ],
+        },
+    ]
+    text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+    audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+    inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+    inputs = inputs.to(model.device).to(model.dtype)
+    text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, do_sample=False, thinker_max_new_tokens=max_new_tokens)
+    text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    model_generation = text.split("\nassistant\n")[-1]
+    return model_generation
+def worker_proc(rank, gpu_id, model_path, video_paths, prompt, out_path):
+    device_map = {"": f"cuda:{gpu_id}"}
+    model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+        model_path,
+        torch_dtype=torch.bfloat16,
+        device_map=device_map,
+        attn_implementation="flash_attention_2",
+    )
+    model.disable_talker()
+    processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
+    fout = open(out_path, "w", encoding="utf-8")
+    for video_path in tqdm(video_paths, desc=f"Worker-{rank}[GPU-{gpu_id}]"):
+        try:
+            model_generation = chat(video_path, prompt, model, processor, model_path)
+            video_id = os.path.basename(video_path).split(".mp4")[0]
+            out_data = {
+                "video_id": video_id,
+                "caption": model_generation,
+            }
+            fout.write(json.dumps(out_data, ensure_ascii=False) + "\n")
+            fout.flush()
+        except Exception as e:
+            print(f"[Worker-{rank}] Error on {video_path}: {e}")
+            traceback.print_exc()
+    fout.close()
+    print(f"[Worker-{rank}] Done, wrote results to {out_path}")
+def run_multi_gpu(model_path, video_paths, prompt_list, final_out_path, num_gpus=8):
+    chunk_size = len(video_paths) // num_gpus + 1
+    chunks = [video_paths[i:i+chunk_size] for i in range(0, len(video_paths), chunk_size)]
+    processes = []
+    tmp_files = []
+    for rank, chunk in enumerate(chunks):
+        gpu_id = rank % num_gpus
+        tmp_out = final_out_path.replace(".jsonl", f".part{rank}.jsonl")
+        tmp_files.append(tmp_out)
+        prompt = random.choice(prompt_list)
+        p = mp.Process(
+            target=worker_proc,
+            args=(rank, gpu_id, model_path, chunk, prompt, tmp_out)
+        )
+        p.start()
+        processes.append(p)
+    for p in processes:
+        p.join()
+    with open(final_out_path, "w", encoding="utf-8") as fout:
+        for tmp in tmp_files:
+            with open(tmp, "r", encoding="utf-8") as fin:
+                for line in fin:
+                    fout.write(line)
+            os.remove(tmp)
+    print(f"All results merged into {final_out_path}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Evaluate a model and save results.")
+    parser.add_argument("--model_path", type=str, required=True, help="Path to the model checkpoint.")
+    parser.add_argument("--fout_path", type=str, required=True, help="Path to the output caption file")
+    args = parser.parse_args()
+    mp.set_start_method("spawn", force=True)
+    video_paths = glob.glob(os.path.join(video_base_dir, "**", "*.mp4"), recursive=True)
+    prompt_list = [
+    "Provide a comprehensive description of all the content in the video, leaving out no details. Be sure to include as much of the audio information as possible, and ensure that your descriptions of the audio and video are closely aligned.",
+    "Thoroughly describe everything in the video, capturing every detail. Include as much information from the audio as possible, and ensure that the descriptions of both audio and video are well-coordinated.",
+    "Please describe all the information in the video without sparing every detail in it. As you describe, you should also describe as much of the information in the audio as possible, and pay attention to the synchronization between the audio and video descriptions.",
+    "Offer a detailed description of the video, making sure to include every detail. Also, incorporate as much information from the audio as you can, and ensure that your descriptions of the audio and video are in sync.",
+    "Describe every aspect of the video in full detail, covering all the information it contains. Additionally, include as much of the audio content as you can, and make sure your descriptions of the audio and video are synchronized.",
+    "Please provide a thorough description of all the content in the video, including every detail. As you describe, ensure that you also cover as much information from the audio as possible, and be mindful of the synchronization between the audio and video as you do so.",
+    "Give a detailed account of everything in the video, capturing all the specifics. While doing so, also include as much information from the audio as possible, ensuring that the descriptions of audio and video are well-synchronized."
+    ]
+    run_multi_gpu(args.model_path, video_paths, prompt_list, args.fout_path, num_gpus=8)

eval_scripts/Daily-Omni/grouped_data.json ADDED Viewed

The diff for this file is too large to render. See raw diff