NeMo_Canary / docs /source /multimodal /mllm /video_neva.rst
Respair's picture
Upload folder using huggingface_hub
b386992 verified
Video NeVA
==========
Model Introduction
------------------
Video NeVa adds support for video modality in NeVa by representing video as multiple image frames.
There is only a minor change done to :class:`~nemo.collections.multimodal.models.multimodal_llm.neva.neva_model.MegatronNevaModel` class in order to support pretraining on video input data.
Representing video input as a series of images is done in :class:`~nemo.collections.multimodal.data.neva.TarOrFolderVideoLoader` class, using Decord which provides convenient video slicing methods.
Video Neva Configuration
^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: yaml
data:
media_type: video
splice_single_frame: null
num_frames: 8
image_token_len: 256
image_folder: null
video_folder: null
- ``media_type``: If set to `video`, NeVa's dataloader goes through the additional preprocessing steps to represent the input video data as a series of image frames.
- ``splice_single_frame``: Can either be set as `first`, `middle` or `last`. This will result in only a single frame in that specific location of the video being selected.
- ``image_token_len``: The NeVa dataloader calculates `image_token_len` based on the height and width of the preprocessed image frame and the patch size of the CLIP model being used.
.. code-block:: python
image_token_len = (224 // 14) * (224 // 14) = 16 * 16 = 256
- ``num_frames``: This is used to select the number of image frames that will be used to represent the video.
- ``video_folder``: This specifies the directory where the video files are located. This follows the same format as NeVa's `image_folder`.
Inference with Video NeVA
=========================
We can run ``neva_evaluation.py`` located in ``NeMo/examples/multimodal/multimodal_llm/neva`` to generate inference results from the Video NeVA model.
Currently, video NeVA supports both image and video inference by changing the config attribute ``inference.media_type`` in ``NeMo/examples/multimodal/multimodal_llm/neva/conf/neva_inference.yaml`` to either ``image`` or ``video``, and adding the corresponding media path ``inference.media_base_path``.
Inference with Pretrained Projectors with Base LM Model
-------------------------------------------------------
An example of an inference script execution:
For running video inference::
CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \
--config-path=/path/to/conf/ \
--config-name=neva_inference.yaml \
tensor_model_parallel_size=4 \
pipeline_model_parallel_size=1 \
neva_model_file=/path/to/projector/checkpoint \
base_model_file=/path/to/base/lm/checkpoint \
trainer.devices=4 \
trainer.precision=bf16 \
prompt_file=/path/to/prompt/file \
inference.media_base_path=/path/to/videos \
inference.media_type=video \
output_file=/path/for/output/file/ \
inference.temperature=0.2 \
inference.top_k=0 \
inference.top_p=0.9 \
inference.greedy=False \
inference.add_BOS=False \
inference.all_probs=False \
inference.repetition_penalty=1.2 \
inference.insert_media_token=right \
inference.tokens_to_generate=256 \
quantization.algorithm=awq \
quantization.enable=False
Example format of ``.jsonl`` prompt_file::
{"video": "video_test.mp4", "text": "Can you describe the scene?", "category": "conv", "question_id": 0}
input video file: video_test.mp4
Output::
<extra_id_0>System
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<extra_id_1>User
Can you describe the scene?<video>
<extra_id_1>Assistant
<extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
CLEAN RESPONSE: Hand with a robot arm
Inference with Finetuned Video NeVA Model (No Need to Specify Base LM)
----------------------------------------------------------------------
An example of an inference script execution:
For running video inference::
CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_VISIBLE_DEVICES=0,1,2,3 python3 /path/to/neva_evaluation.py \
--config-path=/path/to/conf/ \
--config-name=neva_inference.yaml \
tensor_model_parallel_size=4 \
pipeline_model_parallel_size=1 \
neva_model_file=/path/to/video/neva/model \
trainer.devices=4 \
trainer.precision=bf16 \
prompt_file=/path/to/prompt/file \
inference.media_base_path=/path/to/videos \
inference.media_type=video \
output_file=/path/for/output/file/ \
inference.temperature=0.2 \
inference.top_k=0 \
inference.top_p=0.9 \
inference.greedy=False \
inference.add_BOS=False \
inference.all_probs=False \
inference.repetition_penalty=1.2 \
inference.insert_media_token=right \
inference.tokens_to_generate=256 \
quantization.algorithm=awq \
quantization.enable=False
Evaluation with Mixtral as a judge
==================================
We can run ``mixtral_eval.py`` localted in ``NeMo/examples/multimodal/multimodal_llm/neva`` to call mixtral api to give scores for the generated responses of two models.
Here we use ``llava-bench-in-the-wild`` as an example.
Set up
------
Before running the script, we need to set up ``NGC API KEY`` for calling the foundation models on NVIDIA NGC. Once you set up your account on NGC, you can login in and go to `here: <https://build.nvidia.com/mistralai/mixtral-8x7b-instruct/>`_ and click ``Get API Key``. Save the key.
Download dataset
----------------
We first download ``llava-bench-in-the-wild`` dataset:
.. code-block:: bash
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild
And download the `rule.json <https://huggingface.co/spaces/LanguageBind/Video-LLaVA/blob/main/llava/eval/table/rule.json>`_.
Notice the answer file in ``llava-bench-in-the-wild`` is consisted of rows of json string::
{"question_id": 0, "prompt": "What is the name of this famous sight in the photo?", "answer_id": "TeyehNxHw5j8naXfEWaxWd", "model_id": "gpt-4-0314", "metadata": {}, "text": "The famous sight in the photo is Diamond Head."}
You may also have your own response file as::
{"response_id": 0, "response": "The famous sight in the photo is Diamond Head."}
Both formats are ok.
Evaluation
----------
Install package:
.. code-block:: bash
pip install shortuuid
Now you can run the script simply by:
.. code-block:: bash
API_TOKEN=nvapi-<the api you just saved> python3 NeMo/examples/multimodal/multimodal_llm/neva/eval/mixtral_eval.py --model-name-list gpt bard --media-type image \
--question-file llava-bench-in-the-wild/questions.jsonl \ # the question file
--responses-list llava-bench-in-the-wild/answers_gpt4.jsonl llava-bench-in-the-wild/bard_0718.jsonl \ # two answer files / response files
--answers-dir ./ \ # to save the answers
--context-file llava-bench-in-the-wild/context.jsonl \ # context file
--output ./output.json # the generated mixtral reviews for the two models
You'll see the result like::
all 84.8 72.4
llava_bench_complex 77.0 69.0
llava_bench_conv 91.8 77.1
llava_bench_detail 91.3 73.2
Notice when you start a new comparison, you should remove the ``output.json`` file
References
----------
.. bibliography:: ../mm_all.bib
:style: plain
:filter: docname in docnames
:labelprefix: MM-MODELS
:keyprefix: mm-models-