{ "cells": [ { "cell_type": "markdown", "id": "7c9d27250020aba6", "metadata": {}, "source": [ "# Exporting Llama 3.2 Model into Embedding Model To ONNX and TensorRT\n", "\n", "## Goal\n", "\n", "Once the [finetuning the LLaMA 3.2 Model into an Embedding Model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/embedding/llama_embedding.ipynb) is completed, you need to export the model to ONNX and TensorRT for fast inference. Please follow the steps below in order to generate ONNX and TensorRT models.\n", "\n", "**Note:** Please make sure to run the last cell (Convert the Model to HuggingFace Transformer format section) in the [finetuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/embedding/llama_embedding.ipynb) in order to generate the checkpoint used in this tutorial. And please make sure to mount it to **/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/** or change the path of the checkpoint accordingly." ] }, { "cell_type": "markdown", "id": "87846682e01e1a50", "metadata": {}, "source": [ "#### Launch the NeMo Framework container as follows: \n", "\n", "Depending on the number of gpus, `--gpus` might need to adjust accordingly:\n", "```\n", "docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '\"device=0,1\"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.02\n", "```\n", "\n", "#### Launch Jupyter Notebook as follows: \n", "```\n", "jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "656bf98e-bcce-417e-ba29-cdcce7ec1cba", "metadata": {}, "outputs": [], "source": [ "!pip install onnxruntime-gpu" ] }, { "cell_type": "code", "execution_count": null, "id": "523f0670-319d-4983-b4cc-4e8bd379b29d", "metadata": {}, "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", "import torch\n", "from typing import Literal, Optional, Union\n", "from nemo.collections.llm.gpt.model import get_llama_bidirectional_hf_model" ] }, { "cell_type": "code", "execution_count": null, "id": "d12cfd71-225b-4874-9fa9-c45a6d6dc99f", "metadata": {}, "outputs": [], "source": [ "# Paths\n", "hf_model_path = \"/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/\" # Path of the embedding model.\n", "\n", "# HF model parameters\n", "pooling_mode = \"avg\" # Pooling method in the embedding model.\n", "normalize = False\n", "\n", "# ONNX params\n", "opset = 17 # ONNX version number\n", "onnx_export_path = \"/opt/checkpoints/llama_embedding_onnx/\" # Path for the ONNX file.\n", "export_dtype = \"fp32\" # ONNX export data precision.\n", "use_dimension_arg = True # Whether dimension was used in the model forward function or not.\n", "\n", "# TRT params\n", "trt_model_path = Path(\"/opt/checkpoints/llama_embedding_trt/\") # Path for the TensorRT .plan file.\n", "override_layers_to_fp32 = [\"/model/norm/\", \"/pooling_module\", \"/ReduceL2\", \"/Div\", ] # Model specific layers to override the precision to fp32.\n", "override_layernorm_precision_to_fp32 = True # Model specific operation wheter to override layernorm precision or not.\n", "profiling_verbosity = \"layer_names_only\"\n", "export_to_trt = True # Export ONNX model to TensorRT or not.\n", "# Generate version compatible TensorRT engine or not. This option might provide slower inference time. \n", "# If you know the TensorRT engine versions match (where the engine was generated versus where it's used), set this to False.\n", "# Please check here https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#version-compatibility for more information.\n", "trt_version_compatible = True " ] }, { "cell_type": "code", "execution_count": null, "id": "c539a33a-fea9-4168-a179-c277120767fd", "metadata": {}, "outputs": [], "source": [ "# Base Llama model needs to be adapted to turn it into an embedding model.\n", "model, tokenizer = get_llama_bidirectional_hf_model(\n", " model_name_or_path=hf_model_path,\n", " normalize=normalize,\n", " pooling_mode=pooling_mode,\n", " trust_remote_code=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "95cd98f4-1cd4-4c0b-8b92-7bb79991de19", "metadata": {}, "outputs": [], "source": [ "from nemo.export.onnx_llm_exporter import OnnxLLMExporter\n", "\n", "if use_dimension_arg:\n", " input_names = [\"input_ids\", \"attention_mask\", \"dimensions\"] # ONNX specific arguments, input names in this case.\n", " dynamic_axes_input = {\"input_ids\": {0: \"batch_size\", 1: \"seq_length\"},\n", " \"attention_mask\": {0: \"batch_size\", 1: \"seq_length\"}, \"dimensions\": {0: \"batch_size\"}}\n", "else:\n", " input_names = [\"input_ids\", \"attention_mask\"]\n", " dynamic_axes_input = {\"input_ids\": {0: \"batch_size\", 1: \"seq_length\"},\n", " \"attention_mask\": {0: \"batch_size\", 1: \"seq_length\"}}\n", "\n", "output_names = [\"embeddings\"] # ONNX specific arguments, output names in this case.\n", "dynamic_axes_output = {\"embeddings\": {0: \"batch_size\", 1: \"embedding_dim\"}}\n", "\n", "onnx_exporter = OnnxLLMExporter(\n", " onnx_model_dir=onnx_export_path, \n", " model=model,\n", " tokenizer=tokenizer,\n", ")\n", "\n", "onnx_exporter.export( \n", " input_names=input_names,\n", " output_names=output_names,\n", " opset=opset,\n", " dynamic_axes_input=dynamic_axes_input,\n", " dynamic_axes_output=dynamic_axes_output,\n", " export_dtype=\"fp32\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "f1aab9b9-97d0-485c-8d86-dbd21b9a6a33", "metadata": {}, "outputs": [], "source": [ "if export_to_trt:\n", " if use_dimension_arg:\n", " input_profiles = [{\"input_ids\": [[1, 3], [16, 128], [64, 256]], \"attention_mask\": [[1, 3], [16, 128], [64, 256]],\n", " \"dimensions\": [[1], [16], [64]]}]\n", " else:\n", " input_profiles = [{\"input_ids\": [[1, 3], [16, 128], [64, 256]], \"attention_mask\": [[1, 3], [16, 128], [64, 256]]}]\n", "\n", " trt_builder_flags = None\n", " if trt_version_compatible:\n", " import tensorrt as trt\n", " trt_builder_flags=[trt.BuilderFlag.VERSION_COMPATIBLE]\n", " \n", " onnx_exporter.export_onnx_to_trt(\n", " trt_model_dir=trt_model_path,\n", " profiles=input_profiles,\n", " override_layernorm_precision_to_fp32=override_layernorm_precision_to_fp32,\n", " override_layers_to_fp32=override_layers_to_fp32,\n", " profiling_verbosity=profiling_verbosity,\n", " trt_builder_flags=trt_builder_flags,\n", " )" ] }, { "cell_type": "code", "execution_count": null, "id": "051200b7-6eba-44db-b223-059f1dfb60bd", "metadata": {}, "outputs": [], "source": [ "prompt = [\"hello\", \"world\"]\n", "dimensions = [2, 4] if use_dimension_arg else None\n", "\n", "onnx_exporter.forward(prompt, dimensions)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }