| Multimodal Language Model Datasets | |
| ================================== | |
| The NeMo Framework multimodal language model supports the conversation data format, drawing inspiration from and designed based on `LLaVA <https://github.com/haotian-liu/LLaVA/tree/main>`_. Sample datasets can be explored at `LLaVA's data documentation <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. | |
| Prepare the Training Dataset | |
| ---------------------------- | |
| The NeVA model training encompasses two phases: pretraining and fine-tuning. Each phase mandates a unique dataset. | |
| For **pretraining**, utilize the *LAION/CC/SBU BLIP-Caption Concept-balanced 558K* dataset. Access this dataset via `LLaVA's GitHub <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. After procuring the dataset, extract it to: | |
| .. code-block:: bash | |
| /path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json | |
| Acquire the image data from `Hugging Face <https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip>`__ and extract to: | |
| .. code-block:: bash | |
| /path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images | |
| For **fine-tuning**, deploy the *LLaVA-Instruct-150K* dataset. This is also available on `LLaVA's GitHub <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. You can download the prompts from `HuggingFace <https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main>`__: | |
| .. code-block:: bash | |
| /path/to/neva/datasets/LLaVA-Instruct-150K/ | |
| Image data for this phase can be obtained from the `COCO Dataset <https://cocodataset.org/#download>`_. Once downloaded, extract the images to: | |
| .. code-block:: bash | |
| /path/to/neva/datasets/LLaVA-Instruct-150K/images | |
| Additional Preparation for the NeVA Model | |
| ----------------------------------------- | |
| The following instructions are specific to the NeVA model within the NeMo Framework multimodal language models. | |
| Set Up LLaMA-2 Chat Checkpoints | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| Support is available for both the 7B and 13B chat models. Both can be downloaded from `LLaVA's Model Zoo <https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md>`__. After downloading the checkpoint you want from Hugging Face, extract and store it on your local system to prepare for pretraining. | |
| To convert the LLaMA-2 checkpoints to NeMo's format, follow these steps: | |
| 1. Adjust the default YAML file at `megatron_llama_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/language_modeling/conf/megatron_llama_config.yaml>`__. Ensure ``model.mcore_gpt`` and ``model.transformer_engine`` are set to `False` before the checkpoint conversion. | |
| 2. For the 7B chat model, use this conversion command: | |
| .. code-block:: bash | |
| python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \ | |
| --in-file <PATH-TO-HF-CHECKPOINT> \ | |
| --out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo | |
| For the 13B model, adjust the paths in the `--in-file` and `--out-file` parameters accordingly. | |
| 3. Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It's advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and fine-tuning operate without memory complications. | |
| .. code-block:: bash | |
| # Instructions for the 7B model partitioning provided here. | |
| # Adjust parameters for the 13B model as needed. | |
| python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \ | |
| --model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo \ | |
| --target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \ | |
| --tensor_model_parallel_size=1 \ | |
| --target_tensor_model_parallel_size=4 \ | |
| --pipeline_model_parallel_size=1 \ | |
| --target_pipeline_model_parallel_size=1 \ | |
| --tp_conversion_only \ | |
| --model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \ | |
| --tokenizer_model_path=<PATH-TO-HF-CHECKPOINT>/tokenizer.model | |
| Configure Tokenizer | |
| ^^^^^^^^^^^^^^^^^^^ | |
| For NeVA training, it is vital that you integrate special tokens into the tokenizer. After obtaining the 7B/13B model from Hugging Face, you need to procure the corresponding tokenizer model. Referring to the 7B-chat model: | |
| 1. Download the `tokenizer.model <https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview/blob/main/tokenizer.model>`_ to: | |
| .. code-block:: bash | |
| /path/to/neva/tokenizers/tokenizer.model | |
| 2. Step 3 requires NeMo Framework to be installed. For quick setup, we recommend running it within the NeMo Framework container. | |
| 3. Employ the command below to infuse special tokens into the tokenizer: | |
| .. code-block:: bash | |
| cd /opt; git clone https://github.com/google/sentencepiece.git && \ | |
| cd sentencepiece && \ | |
| mkdir build && \ | |
| cd build && \ | |
| cmake .. && \ | |
| make && \ | |
| make install && \ | |
| ldconfig | |
| cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto | |
| python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \ | |
| --input_file /path/to/neva/tokenizers/tokenizer.model \ | |
| --output_file /path/to/neva/tokenizers/tokenizer_neva.model \ | |
| --is_userdefined \ | |
| --tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \ | |
| "<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>" | |