Llama 3.1 Pruning and Distillation with NeMo 2.0 Framework
Llama 3.1 models, developed by Meta, are open-source large language models that deliver state-of-the-art performance on popular industry benchmarks. Pretrained on over 15 trillion tokens, they support a 128K token context length. These models are available in three sizes: 8B, 70B, and 405B. Each size offers two variants: base pretrained and instruction tuned.
NVIDIA NeMo Framework provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 to fit your use case.
NVIDIA TensorRT Model Optimizer is a library (referred to as Model Optimizer, or ModelOpt) comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, and speculative decoding to compress models.
LLM Pruning and Distillation in Practice: The Minitron Approach provides details on teacher fine-tuning, pruning, and distillation on Llama 3.1 as described in the tech report.
How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model provides practical and effective structured compression best practices for LLMs that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining.
Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy introduces the Mistral-NeMo-Minitron 8B, a state-of-the-art 8 billion parameter language model created by pruning and distilling the larger Mistral NeMo 12B model.
Objectives
This tutorial demonstrates how to perform depth-pruning, width-pruning, teacher fine-tuning, and distillation on Llama 3.1 8B using the WikiText-103-v1 dataset with the NeMo Framework. We will start with a HuggingFace checkpoint and convert it to NeMo format to use for pruning and distillation and later convert the distilled model back to HuggingFace format. The WikiText-103-v1 language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.
For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the Meta Llama 3.1 8B teacher model to generate a fine-tuned teacher model, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding 2 pruned models. These models will serve as starting points for distillation to create the final distilled 4B models.
NOTE: A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable, but will be supported in future releases.
Requirements
System Configuration
- Access to at least 8 NVIDIA GPUs, each with a memory of at least 80GB (e.g., 8 x H100-80GB or 8 x A100-80GB).
- A Docker-enabled environment, with NVIDIA Container Runtime installed, which will make the container GPU-aware.
Get your Hugging Face access token, which will be used to download the Llama 3.1 model and tokenizer.
NOTE: The default configuration in the notebook runs on 8 x 80GB NVIDIA GPUs. However, you can potentially reduce the Tensor Parallel size (TENSOR_PARALLEL_SIZE) along with the Micro-Batchsize (MICRO_BATCH_SIZE) in the teacher fine-tuning and distillation scripts to accommodate lower resource availability.
Create a Pruned and Distilled Model with NeMo Framework
For pruning and distilling the model, you will use the NeMo Framework, which is available as a Docker container. These notebooks has been tested on nvcr.io/nvidia/nemo:25.04 container.
- Run the container using the following command. You will mount your local directory to
/workspaceso the model and dataset will be stored in a persistent location. If you are using your own model and dataset, you can change the paths in the notebooks accordingly.
export FW_VERSION=25.04
docker run \
--gpus all \
--shm-size=16GB \
--net=host \
--ulimit memlock=-1 \
--rm -it \
-v ${PWD}:/workspace \
-w /workspace \
nvcr.io/nvidia/nemo:$FW_VERSION bash
- From within the container, copy the notebooks to your local directory so changes remain persistent (only if running first time).
cp -r /opt/NeMo/tutorials/llm/llama/pruning-distillation/* /workspace
- From within the container, login with your Hugging Face token to download the Llama 3.1 model and tokenizer (not required if you have already downloaded the model and tokenizer).
huggingface-cli login --token <YOUR_HF_ACCESS_TOKEN>
- Start the Jupyter lab:
pip install --upgrade ipywidgets notebook
jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
- Then, navigate to this directory which contains a list of notebooks that cover all the steps to create a distilled 4B model.
This workflow is structured into four notebooks:
- Prepare the model and dataset
- Fine-tune the teacher on the dataset
- Prune the fine-tuned teacher model to create a student via either depth-pruning or width-pruning
- Distill knowledge from teacher into student
NOTE:We are exploring two methods to prune the fine-tuned teacher model: depth-pruning and width-pruning. Per the tech report, we can observe that width-pruning generally outperforms depth-pruning while depth pruned model is generally faster so users can choose to perform either depth-pruning or width-pruning or both methods simultaneously.