Respair's picture
Upload folder using huggingface_hub
b386992 verified

Llama 3.1 Pruning and Distillation with NeMo 2.0 Framework

Llama 3.1 models, developed by Meta, are open-source large language models that deliver state-of-the-art performance on popular industry benchmarks. Pretrained on over 15 trillion tokens, they support a 128K token context length. These models are available in three sizes: 8B, 70B, and 405B. Each size offers two variants: base pretrained and instruction tuned.

NVIDIA NeMo Framework provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 to fit your use case.

NVIDIA TensorRT Model Optimizer is a library (referred to as Model Optimizer, or ModelOpt) comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, and speculative decoding to compress models.

LLM Pruning and Distillation in Practice: The Minitron Approach provides details on teacher fine-tuning, pruning, and distillation on Llama 3.1 as described in the tech report.

How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model provides practical and effective structured compression best practices for LLMs that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining.

Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy introduces the Mistral-NeMo-Minitron 8B, a state-of-the-art 8 billion parameter language model created by pruning and distilling the larger Mistral NeMo 12B model.

Objectives

This tutorial demonstrates how to perform depth-pruning, width-pruning, teacher fine-tuning, and distillation on Llama 3.1 8B using the WikiText-103-v1 dataset with the NeMo Framework. We will start with a HuggingFace checkpoint and convert it to NeMo format to use for pruning and distillation and later convert the distilled model back to HuggingFace format. The WikiText-103-v1 language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the Meta Llama 3.1 8B teacher model to generate a fine-tuned teacher model, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding 2 pruned models. These models will serve as starting points for distillation to create the final distilled 4B models.

NOTE: A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable, but will be supported in future releases.

Requirements

System Configuration

  • Access to at least 8 NVIDIA GPUs, each with a memory of at least 80GB (e.g., 8 x H100-80GB or 8 x A100-80GB).
  • A Docker-enabled environment, with NVIDIA Container Runtime installed, which will make the container GPU-aware.

Get your Hugging Face access token, which will be used to download the Llama 3.1 model and tokenizer.

NOTE: The default configuration in the notebook runs on 8 x 80GB NVIDIA GPUs. However, you can potentially reduce the Tensor Parallel size (TENSOR_PARALLEL_SIZE) along with the Micro-Batchsize (MICRO_BATCH_SIZE) in the teacher fine-tuning and distillation scripts to accommodate lower resource availability.

Create a Pruned and Distilled Model with NeMo Framework

For pruning and distilling the model, you will use the NeMo Framework, which is available as a Docker container. These notebooks has been tested on nvcr.io/nvidia/nemo:25.04 container.

  1. Run the container using the following command. You will mount your local directory to /workspace so the model and dataset will be stored in a persistent location. If you are using your own model and dataset, you can change the paths in the notebooks accordingly.
export FW_VERSION=25.04
docker run \
  --gpus all \
  --shm-size=16GB \
  --net=host \
  --ulimit memlock=-1 \
  --rm -it \
  -v ${PWD}:/workspace \
  -w /workspace \
  nvcr.io/nvidia/nemo:$FW_VERSION bash
  1. From within the container, copy the notebooks to your local directory so changes remain persistent (only if running first time).
cp -r /opt/NeMo/tutorials/llm/llama/pruning-distillation/* /workspace
  1. From within the container, login with your Hugging Face token to download the Llama 3.1 model and tokenizer (not required if you have already downloaded the model and tokenizer).
huggingface-cli login --token <YOUR_HF_ACCESS_TOKEN>
  1. Start the Jupyter lab:
pip install --upgrade ipywidgets notebook
jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
  1. Then, navigate to this directory which contains a list of notebooks that cover all the steps to create a distilled 4B model.

This workflow is structured into four notebooks:

  1. Prepare the model and dataset
  2. Fine-tune the teacher on the dataset
  3. Prune the fine-tuned teacher model to create a student via either depth-pruning or width-pruning
  4. Distill knowledge from teacher into student

NOTE: We are exploring two methods to prune the fine-tuned teacher model: depth-pruning and width-pruning. Per the tech report, we can observe that width-pruning generally outperforms depth-pruning while depth pruned model is generally faster so users can choose to perform either depth-pruning or width-pruning or both methods simultaneously.