# Exporting Llama 3.2 Model into Embedding Model To ONNX and TensorRT

## Goal

Once the [finetuning the LLaMA 3.2 Model into an Embedding Model](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/embedding/llama_embedding.ipynb) is completed, you need to export the model to ONNX and TensorRT for fast inference. Please follow the steps below in order to generate ONNX and TensorRT models.

**Note:** Please make sure to run the last cell (Convert the Model to HuggingFace Transformer format section) in the [finetuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/embedding/llama_embedding.ipynb) in order to generate the checkpoint used in this tutorial. And please make sure to mount it to **/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/** or change the path of the checkpoint accordingly.

#### Launch the NeMo Framework container as follows: 

Depending on the number of gpus, `--gpus` might need to adjust accordingly:
```
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '"device=0,1"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.02
```

#### Launch Jupyter Notebook as follows: 
```
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

```

In [None]:
!pip install onnxruntime-gpu

In [None]:
import os
from pathlib import Path
import torch
from typing import Literal, Optional, Union
from nemo.collections.llm.gpt.model import get_llama_bidirectional_hf_model

In [None]:
# Paths
hf_model_path = "/opt/checkpoints/llama-3.2-nv-embedqa-1b-v2/" # Path of the embedding model.

# HF model parameters
pooling_mode = "avg" # Pooling method in the embedding model.
normalize = False

# ONNX params
opset = 17 # ONNX version number
onnx_export_path = "/opt/checkpoints/llama_embedding_onnx/" # Path for the ONNX file.
export_dtype = "fp32" # ONNX export data precision.
use_dimension_arg = True # Whether dimension was used in the model forward function or not.

# TRT params
trt_model_path = Path("/opt/checkpoints/llama_embedding_trt/") # Path for the TensorRT .plan file.
override_layers_to_fp32 = ["/model/norm/", "/pooling_module", "/ReduceL2", "/Div", ] # Model specific layers to override the precision to fp32.
override_layernorm_precision_to_fp32 = True # Model specific operation wheter to override layernorm precision or not.
profiling_verbosity = "layer_names_only"
export_to_trt = True # Export ONNX model to TensorRT or not.
# Generate version compatible TensorRT engine or not. This option might provide slower inference time. 
# If you know the TensorRT engine versions match (where the engine was generated versus where it's used), set this to False.
# Please check here https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/advanced.html#version-compatibility for more information.
trt_version_compatible = True 

In [None]:
# Base Llama model needs to be adapted to turn it into an embedding model.
model, tokenizer = get_llama_bidirectional_hf_model(
 model_name_or_path=hf_model_path,
 normalize=normalize,
 pooling_mode=pooling_mode,
 trust_remote_code=True,
)

In [None]:
from nemo.export.onnx_llm_exporter import OnnxLLMExporter

if use_dimension_arg:
 input_names = ["input_ids", "attention_mask", "dimensions"] # ONNX specific arguments, input names in this case.
 dynamic_axes_input = {"input_ids": {0: "batch_size", 1: "seq_length"},
 "attention_mask": {0: "batch_size", 1: "seq_length"}, "dimensions": {0: "batch_size"}}
else:
 input_names = ["input_ids", "attention_mask"]
 dynamic_axes_input = {"input_ids": {0: "batch_size", 1: "seq_length"},
 "attention_mask": {0: "batch_size", 1: "seq_length"}}

output_names = ["embeddings"] # ONNX specific arguments, output names in this case.
dynamic_axes_output = {"embeddings": {0: "batch_size", 1: "embedding_dim"}}

onnx_exporter = OnnxLLMExporter(
 onnx_model_dir=onnx_export_path, 
 model=model,
 tokenizer=tokenizer,
)

onnx_exporter.export( 
 input_names=input_names,
 output_names=output_names,
 opset=opset,
 dynamic_axes_input=dynamic_axes_input,
 dynamic_axes_output=dynamic_axes_output,
 export_dtype="fp32",
)

In [None]:
if export_to_trt:
 if use_dimension_arg:
 input_profiles = [{"input_ids": [[1, 3], [16, 128], [64, 256]], "attention_mask": [[1, 3], [16, 128], [64, 256]],
 "dimensions": [[1], [16], [64]]}]
 else:
 input_profiles = [{"input_ids": [[1, 3], [16, 128], [64, 256]], "attention_mask": [[1, 3], [16, 128], [64, 256]]}]

 trt_builder_flags = None
 if trt_version_compatible:
 import tensorrt as trt
 trt_builder_flags=[trt.BuilderFlag.VERSION_COMPATIBLE]
 
 onnx_exporter.export_onnx_to_trt(
 trt_model_dir=trt_model_path,
 profiles=input_profiles,
 override_layernorm_precision_to_fp32=override_layernorm_precision_to_fp32,
 override_layers_to_fp32=override_layers_to_fp32,
 profiling_verbosity=profiling_verbosity,
 trt_builder_flags=trt_builder_flags,
 )

In [None]:
prompt = ["hello", "world"]
dimensions = [2, 4] if use_dimension_arg else None

onnx_exporter.forward(prompt, dimensions)