{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LITA Checkpoint Conversion, Finetuning and Inference Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note:\n", "Currently, this notebook can be run in a NeMo container (>= 24.07). An example command to launch the container:\n", "\n", "```\n", "docker run --gpus all -it --rm -v $PWD:/ws --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 \n", "```\n", "For inference and finetuning, you need to increase the share memory size to avoid some OOM issue. For example,\n", "```\n", "docker run --gpus all -it --rm -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:dev\n", "```\n", "\n", "By `-v $PWD:/ws`, we can mount the current local directory to `/ws/` in docker container. We may use this local directory to put the `NeMo` source code, checkpoints and dataset we will generate.\n", "\n", "If you wanna use NeMo container (>24.04 and < 24.07) (not recommended), you need to manually mount the latest nemo:\n", "```\n", "docker run --gpus all -it --rm -v :/opt/NeMo -v $PWD:/ws --shm-size=128g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# LITA Introduction\n", "\n", "[LITA](https://arxiv.org/pdf/2403.19046) stands for Language Instructed Temporal-Localization Assistant, which demonstrates strong performance on Reasoning Temporal Localization (RTL) task. It introduces time tokens to better help LLM understand 'When?' question in video. The below figure from [LITA paper](https://arxiv.org/pdf/2403.19046) shows a clear idea of how LITA works.\n", "\n", "\"drawing\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenizer and Checkpoint Conversion\n", "As we learned that LITA introduces `time tokens` so that timestamps of events in a video would be represented as time tokens instead of the original float point timestamps. Therefore we need to add these time tokens to the tokenizer of the backbone/LLM model. In this example, we take `Llama-3-VILA1.5-8B` as an example to show how to integrate LITA to a LLaVA like model. You may also use similar steps to convert other llama or LLaVA like models that have backbone LLM as llama such as [vicuna](https://huggingface.co/lmsys/vicuna-13b-v1.5) and [llava-v1.6-vicuna-13b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b).\n", "\n", "Please download the huggingface `Llama-3-VILA1.5-8B` model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "%%bash\n", "mkdir /ws/pretrained_models && cd /ws/pretrained_models\n", "git clone https://huggingface.co/Efficient-Large-Model/Llama-3-VILA1.5-8B" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenizer conversion\n", "Here we show how to add 100 time tokens and some nemo extra tokens to a huggingface tokenizer.\n", "For the definition of nemo extra tokens, please refer to `/opt/NeMo/nemo/collections/multimodal/data/neva/conversation.py`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define the TIME_TOKEN_TEMPLATE\n", "TIME_TOKEN_TEMPLATE = \"\"\n", "hf_llm_model_path='/ws/pretrained_models/Llama-3-VILA1.5-8B/llm'\n", "tokenizer_path = '/ws/converted_models/tokenizer/'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "import transformers\n", "tokenizer = transformers.AutoTokenizer.from_pretrained(hf_llm_model_path)\n", "DEFAULT_IM_START_TOKEN = \"\" # mark the start of the slow token\n", "DEFAULT_IM_END_TOKEN = \"\" # the end of the slow token\n", "VID_START_TOKEN = \"\" # the start of the fast token\n", "VID_END_TOKEN = \"\" # the end of the fast token\n", "num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, VID_START_TOKEN, VID_END_TOKEN], special_tokens=True)\n", "tokenizer.pad_token = tokenizer.eos_token # use eos token as pad token\n", "num_time_tokens = 100\n", "time_tokens = [TIME_TOKEN_TEMPLATE.format(t=x) for x in range(num_time_tokens)]\n", "num_new_tokens = tokenizer.add_tokens(time_tokens)\n", "# add the other nemo extra tokens\n", "extra_tokens = [\"\",\"\",\"\",\"\",\"\",\"\"]\n", "tokenizer.add_tokens(extra_tokens)\n", "tokenizer.save_pretrained(tokenizer_path)\n", "print(len(tokenizer.vocab))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check the tokenizer by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer\n", "tokenizer = get_nmt_tokenizer(library=\"huggingface\", model_name=tokenizer_path)\n", "print(len(tokenizer.vocab))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice if you wanna convert checkpoints trained from [LITA1.0](https://github.com/NVlabs/LITA), you should put all the extra tokens including `DEFAULT_IM_START_TOKEN` and `DEFAULT_IM_END_TOKEN` at the end of the time tokens." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checkpoint Conversion\n", "Since VILA and LITA shared a similar model structure as LLaVA, we'll leverage `/opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py` for converting the checkpoint. Since VILA and LITA depends on LLaVA, we need to clone LLaVA first.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "%%bash\n", "git clone --depth 1 --branch v1.2.2 https://github.com/haotian-liu/LLaVA/ /ws/LLaVA\n", "cd /ws" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "vscode": { "languageId": "shellscript" } }, "outputs": [], "source": [ "%%bash\n", "export PYTHONPATH=/ws/LLaVA:$PYTHONPATH\n", "# check the config file in /opt/NeMo/examples/multimodal/multimodal_llm/neva/conf/vita_config.yaml\n", "python /opt/NeMo/examples/multimodal/multimodal_llm/neva/convert_llava_to_neva.py \\\n", " --in-file /ws/pretrained_models/Llama-3-VILA1.5-8B/llm \\\n", " --mm-vision-tower /ws/pretrained_models/Llama-3-VILA1.5-8B/vision_tower \\\n", " --mm-projector-ckpt-dir /ws/pretrained_models/Llama-3-VILA1.5-8B/mm_projector \\\n", " --out-file /ws/converted_models/Llama-3-VILA1.5-8B.nemo \\\n", " --tokenizer-model /ws/converted_models/tokenizer/ \\\n", " --config-file vita_config.yaml \\\n", " --model-type VITA \\\n", " --conv-template llama_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice if `mm_vision_tower` can be downloaded from huggingface and you don't want to change it, then you don't need to explicitly add this option. And similarly, only when you want to change the `mm_projector`, you will need to add the `mm_projector_ckpt_dir`.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finetuning\n", "\n", "In this section, we'll preprocess the Dense Video Captioning dataset and then do finetuning with the nemo ckpt we just converted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert Dataset\n", "The targeted dataset file format for finetuning should be like:\n", "```bash\n", "[\n", " # 1st example: video question answer\n", " {\n", " \"id\": \"1043215450\",\n", " \"video\": \"076101_076150/1043215450.mp4\", # video_path will be prepended\n", " \"conversations\": \n", " [\n", " {\"from\": \"human\", \"value\": \"