Upload folder using huggingface_hub

b386992 verified 7 months ago

5.4 kB

	Multimodal Language Model Datasets
	==================================

	The NeMo Framework multimodal language model supports the conversation data format, drawing inspiration from and designed based on `LLaVA <https://github.com/haotian-liu/LLaVA/tree/main>`_. Sample datasets can be explored at `LLaVA's data documentation <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_.

	Prepare the Training Dataset
	----------------------------

	The NeVA model training encompasses two phases: pretraining and fine-tuning. Each phase mandates a unique dataset.

	For pretraining, utilize the LAION/CC/SBU BLIP-Caption Concept-balanced 558K dataset. Access this dataset via `LLaVA's GitHub <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. After procuring the dataset, extract it to:

	.. code-block:: bash

	/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/blip_laion_cc_sbu_558k.json

	Acquire the image data from `Hugging Face <https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip>`__ and extract to:

	.. code-block:: bash

	/path/to/neva/datasets/LLaVA-Pretrain-LCS-558K/images

	For fine-tuning, deploy the LLaVA-Instruct-150K dataset. This is also available on `LLaVA's GitHub <https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md>`_. You can download the prompts from `HuggingFace <https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/tree/main>`__:

	.. code-block:: bash

	/path/to/neva/datasets/LLaVA-Instruct-150K/

	Image data for this phase can be obtained from the `COCO Dataset <https://cocodataset.org/#download>`_. Once downloaded, extract the images to:

	.. code-block:: bash

	/path/to/neva/datasets/LLaVA-Instruct-150K/images

	Additional Preparation for the NeVA Model
	-----------------------------------------

	The following instructions are specific to the NeVA model within the NeMo Framework multimodal language models.

	Set Up LLaMA-2 Chat Checkpoints
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Support is available for both the 7B and 13B chat models. Both can be downloaded from `LLaVA's Model Zoo <https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md>`__. After downloading the checkpoint you want from Hugging Face, extract and store it on your local system to prepare for pretraining.

	To convert the LLaMA-2 checkpoints to NeMo's format, follow these steps:

	1. Adjust the default YAML file at `megatron_llama_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/language_modeling/conf/megatron_llama_config.yaml>`__. Ensure ``model.mcore_gpt`` and ``model.transformer_engine`` are set to `False` before the checkpoint conversion.

	2. For the 7B chat model, use this conversion command:

	.. code-block:: bash

	python /opt/NeMo/scripts/nlp_language_modeling/convert_hf_llama_to_nemo.py \
	--in-file <PATH-TO-HF-CHECKPOINT> \
	--out-file /path/to/neva/checkpoints/llama-2-7b-chat.nemo

	For the 13B model, adjust the paths in the `--in-file` and `--out-file` parameters accordingly.

	3. Execute the subsequent command to divide the checkpoint for tensor model parallel sizes of 4 or 8. It's advisable to use TP=4 for the 7B model and TP=8 for the 13B model to ensure both pretraining and fine-tuning operate without memory complications.

	.. code-block:: bash

	# Instructions for the 7B model partitioning provided here.
	# Adjust parameters for the 13B model as needed.
	python /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py \
	--model_file=/path/to/neva/checkpoints/llama-2-7b-chat.nemo \
	--target_file=/path/to/neva/checkpoints/llama-2-7b-chat-tp4.nemo \
	--tensor_model_parallel_size=1 \
	--target_tensor_model_parallel_size=4 \
	--pipeline_model_parallel_size=1 \
	--target_pipeline_model_parallel_size=1 \
	--tp_conversion_only \
	--model_class="nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel" \
	--tokenizer_model_path=<PATH-TO-HF-CHECKPOINT>/tokenizer.model

	Configure Tokenizer
	^^^^^^^^^^^^^^^^^^^

	For NeVA training, it is vital that you integrate special tokens into the tokenizer. After obtaining the 7B/13B model from Hugging Face, you need to procure the corresponding tokenizer model. Referring to the 7B-chat model:

	1. Download the `tokenizer.model <https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview/blob/main/tokenizer.model>`_ to:

	.. code-block:: bash

	/path/to/neva/tokenizers/tokenizer.model

	2. Step 3 requires NeMo Framework to be installed. For quick setup, we recommend running it within the NeMo Framework container.

	3. Employ the command below to infuse special tokens into the tokenizer:

	.. code-block:: bash

	cd /opt; git clone https://github.com/google/sentencepiece.git && \
	cd sentencepiece && \
	mkdir build && \
	cd build && \
	cmake .. && \
	make && \
	make install && \
	ldconfig
	cd /opt/sentencepiece/src/; protoc --python_out=/opt/NeMo/scripts/tokenizers/ sentencepiece_model.proto
	python /opt/NeMo/scripts/tokenizers/add_special_tokens_to_sentencepiece.py \
	--input_file /path/to/neva/tokenizers/tokenizer.model \
	--output_file /path/to/neva/tokenizers/tokenizer_neva.model \
	--is_userdefined \
	--tokens "<extra_id_0>" "<extra_id_1>" "<extra_id_2>" "<extra_id_3>" \
	"<extra_id_4>" "<extra_id_5>" "<extra_id_6>" "<extra_id_7>"