File size: 6,695 Bytes
b386992
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# Llama 3.1 Pruning and Distillation with NeMo 2.0 Framework

[Llama 3.1](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/) models, developed by Meta, are open-source large language models that deliver state-of-the-art performance on popular industry benchmarks. Pretrained on over 15 trillion tokens, they support a 128K token context length. These models are available in three sizes: 8B, 70B, and 405B. Each size offers two variants: base pretrained and instruction tuned.

[NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) provides tools to perform teacher fine-tuning, pruning, and distillation on Llama 3.1 to fit your use case.

[NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library (referred to as **Model Optimizer**, or **ModelOpt**) comprising state-of-the-art model optimization techniques including [quantization](https://github.com/NVIDIA/TensorRT-Model-Optimizer#quantization-examples-docs), [distillation](https://github.com/NVIDIA/TensorRT-Model-Optimizer#distillation-examples-docs), [pruning](https://github.com/NVIDIA/TensorRT-Model-Optimizer#pruning-examples-docs), and [speculative decoding](https://github.com/NVIDIA/TensorRT-Model-Optimizer#speculative-decoding-examples-docs) to compress models.

[LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796) provides details on teacher fine-tuning, pruning, and distillation on Llama 3.1 as described in the [tech report](https://arxiv.org/abs/2408.11796).

[How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) provides practical and effective structured compression best practices for LLMs that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining.

[Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy](https://developer.nvidia.com/blog/mistral-nemo-minitron-8b-foundation-model-delivers-unparalleled-accuracy/) introduces the Mistral-NeMo-Minitron 8B, a state-of-the-art 8 billion parameter language model created by pruning and distilling the larger Mistral NeMo 12B model.

## Objectives

This tutorial demonstrates how to perform depth-pruning, width-pruning, teacher fine-tuning, and distillation on **Llama 3.1 8B** using the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset with the NeMo Framework. We will start with a HuggingFace checkpoint and convert it to NeMo format to use for pruning and distillation and later convert the distilled model back to HuggingFace format. The `WikiText-103-v1` language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the `Meta Llama 3.1 8B` teacher model to generate a fine-tuned teacher model, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding 2 pruned models. These models will serve as starting points for distillation to create the final distilled 4B models.

**NOTE:** A subset of functions is being demonstrated in the notebooks. Some features like Neural Architecture Search (NAS) are unavailable, but will be supported in future releases.

## Requirements

### System Configuration
- Access to at least 8 NVIDIA GPUs, each with a memory of at least 80GB (e.g., 8 x H100-80GB or 8 x A100-80GB).
- A Docker-enabled environment, with [NVIDIA Container Runtime](https://developer.nvidia.com/container-runtime) installed, which will make the container GPU-aware.

Get your Hugging Face [access token](https://huggingface.co/docs/hub/en/security-tokens), which will be used to download the Llama 3.1 model and tokenizer.

**NOTE:** The default configuration in the notebook runs on 8 x 80GB NVIDIA GPUs. However, you can potentially reduce the Tensor Parallel size (`TENSOR_PARALLEL_SIZE`) along with the Micro-Batchsize (`MICRO_BATCH_SIZE`) in the teacher fine-tuning and distillation scripts to accommodate lower resource availability.

## Create a Pruned and Distilled Model with NeMo Framework

For pruning and distilling the model, you will use the NeMo Framework, which is available as a [Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo). These notebooks has been tested on `nvcr.io/nvidia/nemo:25.04` container.

1. Run the container using the following command. You will mount your local directory to `/workspace` so the model and dataset will be stored in a persistent location. If you are using your own model and dataset, you can change the paths in the notebooks accordingly.

```bash
export FW_VERSION=25.04
```

```bash
docker run \
  --gpus all \
  --shm-size=16GB \
  --net=host \
  --ulimit memlock=-1 \
  --rm -it \
  -v ${PWD}:/workspace \
  -w /workspace \
  nvcr.io/nvidia/nemo:$FW_VERSION bash
```

2. From within the container, copy the notebooks to your local directory so changes remain persistent (only if running first time).

```bash
cp -r /opt/NeMo/tutorials/llm/llama/pruning-distillation/* /workspace
```

3. From within the container, login with your Hugging Face token to download the Llama 3.1 model and tokenizer (not required if you have already downloaded the model and tokenizer).

```bash
huggingface-cli login --token <YOUR_HF_ACCESS_TOKEN>
```

4. Start the Jupyter lab:

```bash
pip install --upgrade ipywidgets notebook
jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
```

5. Then, navigate to this directory which contains a list of notebooks that cover all the steps to create a distilled 4B model.

This workflow is structured into four notebooks:
  1. [Prepare the model and dataset](./01_model_and_data_preparation.ipynb)
  2. [Fine-tune the teacher on the dataset](./02_teacher_finetuning.ipynb)
  3. [Prune the fine-tuned teacher model](./03_pruning.ipynb) to create a student via either depth-pruning or width-pruning
  4. [Distill knowledge from teacher into student](./04_distillation.ipynb)

> `NOTE:` We are exploring two methods to prune the fine-tuned teacher model: depth-pruning and width-pruning. Per the [tech report](https://arxiv.org/pdf/2408.11796), we can observe that width-pruning generally outperforms depth-pruning while depth pruned model is generally faster so users can choose to perform either depth-pruning or width-pruning or both methods simultaneously.