Respair's picture
Upload folder using huggingface_hub
b386992 verified
Multimodal Language Model Datasets
==================================
The NeMo Framework multimodal language model supports the conversation data format, drawing inspiration from and designed based on `LLaVA <https://github.com/haotian-liu/LLaVA/tree/main>`_. Sample datasets can be explored at `LLaVA's data documentation <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_.
Prepare the Training Dataset
----------------------------
The NeVA model training encompasses two phases: pretraining and fine-tuning. Each phase mandates a unique dataset.
For **pretraining**, utilize the *LAION/CC/SBU BLIP-Caption Concept-balanced 558K* dataset. Access this dataset via `LLaVA's GitHub <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. After procuring the dataset, extract it to:
.. code-block:: bash
/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json
Acquire the image data from `Hugging Face <https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip>`__ and extract to:
.. code-block:: bash
/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images
For **fine-tuning**, deploy the *LLaVA-Instruct-150K* dataset. This is also available on `LLaVA's GitHub <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. You can download the prompts from `HuggingFace <https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main>`__:
.. code-block:: bash
/path/to/neva/datasets/LLaVA-Instruct-150K/
Image data for this phase can be obtained from the `COCO Dataset <https://cocodataset.org/#download>`_. Once downloaded, extract the images to:
.. code-block:: bash
/path/to/neva/datasets/LLaVA-Instruct-150K/images
Additional Preparation for the NeVA Model
-----------------------------------------
The following instructions are specific to the NeVA model within the NeMo Framework multimodal language models.
Set Up LLaMA-2 Chat Checkpoints
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Support is available for both the 7B and 13B chat models. Both can be downloaded from `LLaVA's Model Zoo <https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md>`__. After downloading the checkpoint you want from Hugging Face, extract and store it on your local system to prepare for pretraining.
To convert the LLaMA-2 checkpoints to NeMo's format, follow these steps:
1. Adjust the default YAML file at `megatron_llama_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/language_modeling/conf/megatron_llama_config.yaml>`__. Ensure ``model.mcore_gpt`` and ``model.transformer_engine`` are set to `False` before the checkpoint conversion.
2. For the 7B chat model, use this conversion command:
.. code-block:: bash
python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \
--in-file <PATH-TO-HF-CHECKPOINT> \
--out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo
For the 13B model, adjust the paths in the `--in-file` and `--out-file` parameters accordingly.
3. Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It's advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and fine-tuning operate without memory complications.
.. code-block:: bash
# Instructions for the 7B model partitioning provided here.
# Adjust parameters for the 13B model as needed.
python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \
--model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo \
--target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \
--tensor_model_parallel_size=1 \
--target_tensor_model_parallel_size=4 \
--pipeline_model_parallel_size=1 \
--target_pipeline_model_parallel_size=1 \
--tp_conversion_only \
--model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \
--tokenizer_model_path=<PATH-TO-HF-CHECKPOINT>/tokenizer.model
Configure Tokenizer
^^^^^^^^^^^^^^^^^^^
For NeVA training, it is vital that you integrate special tokens into the tokenizer. After obtaining the 7B/13B model from Hugging Face, you need to procure the corresponding tokenizer model. Referring to the 7B-chat model:
1. Download the `tokenizer.model <https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview/blob/main/tokenizer.model>`_ to:
.. code-block:: bash
/path/to/neva/tokenizers/tokenizer.model
2. Step 3 requires NeMo Framework to be installed. For quick setup, we recommend running it within the NeMo Framework container.
3. Employ the command below to infuse special tokens into the tokenizer:
.. code-block:: bash
cd /opt; git clone https://github.com/google/sentencepiece.git && \
cd sentencepiece && \
mkdir build && \
cd build && \
cmake .. && \
make && \
make install && \
ldconfig
cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
--input_file /path/to/neva/tokenizers/tokenizer.model \
--output_file /path/to/neva/tokenizers/tokenizer_neva.model \
--is_userdefined \
--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
"<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"