Multimodal Language Model Datasets ================================== The NeMo Framework multimodal language model supports the conversation data format, drawing inspiration from and designed based on `LLaVA `_. Sample datasets can be explored at `LLaVA's data documentation `_. Prepare the Training Dataset ---------------------------- The NeVA model training encompasses two phases: pretraining and fine-tuning. Each phase mandates a unique dataset. For **pretraining**, utilize the *LAION/CC/SBU BLIP-Caption Concept-balanced 558K* dataset. Access this dataset via `LLaVA's GitHub `_. After procuring the dataset, extract it to: .. code-block:: bash /path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json Acquire the image data from `Hugging Face `__ and extract to: .. code-block:: bash /path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images For **fine-tuning**, deploy the *LLaVA-Instruct-150K* dataset. This is also available on `LLaVA's GitHub `_. You can download the prompts from `HuggingFace `__: .. code-block:: bash /path/to/neva/datasets/LLaVA-Instruct-150K/ Image data for this phase can be obtained from the `COCO Dataset `_. Once downloaded, extract the images to: .. code-block:: bash /path/to/neva/datasets/LLaVA-Instruct-150K/images Additional Preparation for the NeVA Model ----------------------------------------- The following instructions are specific to the NeVA model within the NeMo Framework multimodal language models. Set Up LLaMA-2 Chat Checkpoints ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Support is available for both the 7B and 13B chat models. Both can be downloaded from `LLaVA's Model Zoo `__. After downloading the checkpoint you want from Hugging Face, extract and store it on your local system to prepare for pretraining. To convert the LLaMA-2 checkpoints to NeMo's format, follow these steps: 1. Adjust the default YAML file at `megatron_llama_config.yaml `__. Ensure ``model.mcore_gpt`` and ``model.transformer_engine`` are set to `False` before the checkpoint conversion. 2. For the 7B chat model, use this conversion command: .. code-block:: bash python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \ --in-file \ --out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo For the 13B model, adjust the paths in the `--in-file` and `--out-file` parameters accordingly. 3. Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It's advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and fine-tuning operate without memory complications. .. code-block:: bash # Instructions for the 7B model partitioning provided here. # Adjust parameters for the 13B model as needed. python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \ --model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo \ --target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \ --tensor_model_parallel_size=1 \ --target_tensor_model_parallel_size=4 \ --pipeline_model_parallel_size=1 \ --target_pipeline_model_parallel_size=1 \ --tp_conversion_only \ --model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \ --tokenizer_model_path=/tokenizer.model Configure Tokenizer ^^^^^^^^^^^^^^^^^^^ For NeVA training, it is vital that you integrate special tokens into the tokenizer. After obtaining the 7B/13B model from Hugging Face, you need to procure the corresponding tokenizer model. Referring to the 7B-chat model: 1. Download the `tokenizer.model `_ to: .. code-block:: bash /path/to/neva/tokenizers/tokenizer.model 2. Step 3 requires NeMo Framework to be installed. For quick setup, we recommend running it within the NeMo Framework container. 3. Employ the command below to infuse special tokens into the tokenizer: .. code-block:: bash cd /opt; git clone https://github.com/google/sentencepiece.git && \ cd sentencepiece && \ mkdir build && \ cd build && \ cmake .. && \ make && \ make install && \ ldconfig cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \ --input_file /path/to/neva/tokenizers/tokenizer.model \ --output_file /path/to/neva/tokenizers/tokenizer_neva.model \ --is_userdefined \ --tokens "" "" "" "" \ "" "" "" ""