Initial testing

Browse files

Files changed (9) hide show

README.md +185 -3
accelerate_config/fsdp2.yaml +25 -0
distill.py +78 -0
hello.py +6 -0
main.py +160 -0
pyproject.toml +9 -0
requirements.txt +4 -0
run.sh +10 -0
uv.lock +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,185 @@
----
-license: apache-2.0
----

+# Knowledge Distillation
+Knowledge Distillation is a machine learning technique where a compact "student" model learns to replicate the behavior of a larger, more complex "teacher" model to achieve comparable performance with improved efficiency.
+Model Optimizer's Distillation is a set of wrappers and utilities to easily perform Knowledge Distillation among teacher and student models. Given a pretrained teacher model, Distillation has the potential to train a smaller student model faster and/or with higher accuracy than the student model could achieve on its own.
+This section focuses on demonstrating how to apply Model Optimizer to perform knowledge distillation with ease.
+<div align="center">
+| **Section** | **Description** | **Link** | **Docs** |
+| :------------: | :------------: | :------------: | :------------: |
+| Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | |
+| Getting Started | Learn how to optimize your models using distillation to produce more intellegant smaller models | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
+| Support Matrix | View the support matrix to see compatibility and feature availability across different models | \[[Link](#support-matrix)\] | |
+| Distillation with Megatron-LM | Learn how to distill your models with Megatron-LM Framework | \[[Link](#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\] | |
+| Distillation with NeMo | Learn how to distill your models with NeMo Framework | \[[Link](#knowledge-distillation-kd-in-nvidia-nemo-framework)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
+| Distillation with Huggingface | Learn how to distill your models with Hugging Face | \[[Link](#knowledge-distillation-kd-for-huggingface-models)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\] |
+| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
+| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | \[[Link](../nemo_run/prune_distill/README.md)\] | |
+</div>
+## Pre-Requisites
+### Docker
+For Hugging Face models, please use the PyTorch docker image (e.g., `nvcr.io/nvidia/pytorch:25.06-py3`).
+For NeMo models, use the NeMo container (e.g., `nvcr.io/nvidia/nemo:25.09`) which has all the dependencies installed.
+Visit our [installation docs](https://nvidia.github.io/Model-Optimizer/getting_started/2_installation.html) for more information.
+Also follow the installation steps below to upgrade to the latest version of Model Optimizer and install example-specific dependencies.
+### Local Installation
+For Hugging Face models, install Model Optimizer with `hf` dependencies using `pip` from [PyPI](https://pypi.org/project/nvidia-modelopt/) and install the requirements for the example:
+```bash
+pip install -U nvidia-modelopt[hf]
+pip install -r requirements.txt
+```
+## Getting Started
+### Set up your base models
+First obtain both a pretrained model to act as the teacher and a (usually smaller) model to serve as the student.
+```python
+from transformers import AutoModelForCausalLM
+# Define student & teacher
+student_model = AutoModelForCausalLM.from_pretrained("student-model-id-or-path")
+teacher_model = AutoModelForCausalLM.from_pretrained("teacher-model-id-or-path")
+```
+### Set up the meta model
+As Knowledge Distillation involves (at least) two models, ModelOpt simplifies the integration process by wrapping both student and teacher into one meta model.
+Please see an example Distillation setup below. This example assumes the outputs of `teacher_model` and `student_model` are logits.
+```python
+import modelopt.torch.distill as mtd
+distillation_config = {
+    "teacher_model": teacher_model,
+    "criterion": mtd.LogitsDistillationLoss(),  # callable receiving student and teacher outputs, in order
+    "loss_balancer": mtd.StaticLossBalancer(),  # combines multiple losses; omit if only one distillation loss used
+}
+distillation_model = mtd.convert(student_model, mode=[("kd_loss", distillation_config)])
+```
+The `teacher_model` can be either a `nn.Module`, a callable which returns an `nn.Module`, or a tuple of `(model_cls, args, kwargs)`. The `criterion` is the distillation loss used between student and teacher tensors. The `loss_balancer` determines how the original and distillation losses are combined (if needed).
+See [Distillation](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html) for more info.
+### Distill during training
+To Distill from teacher to student, simply use the meta model in the usual training loop, while also using the meta model’s `.compute_kd_loss()` method to compute the distillation loss, in addition to the original user loss.
+An example of Distillation training is given below:
+```python
+# Setup the data loaders. As example:
+train_loader = get_train_loader()
+# Define user loss function. As example:
+loss_fn = get_user_loss_fn()
+for input, labels in train_dataloader:
+    distillation_model.zero_grad()
+    # Forward through the wrapped models
+    out = distillation_model(input)
+    # Same loss as originally present
+    loss = loss_fn(out, labels)
+    # Combine distillation and user losses
+    loss_total = distillation_model.compute_kd_loss(student_loss=loss)
+    loss_total.backward()
+```
+> [!NOTE]
+> DataParallel may break ModelOpt’s Distillation feature. Note that HuggingFace Trainer uses DataParallel by default.
+### Export trained model
+The model can easily be reverted to its original class for further use (i.e deployment) without any ModelOpt modifications attached.
+```python
+model = mtd.export(distillation_model)
+```
+## Support Matrix
+### Current out of the box components
+Loss criterion:
+- `mtd.LogitsDistillationLoss()` - Standard KL-Divergence on output logits
+- `mtd.MGDLoss()` - Masked Generative Distillation loss for 2D convolutional outputs
+- `mtd.MFTLoss()` - KL-divergence loss with Minifinetuning threshold modification
+Loss balancers:
+- `mtd.StaticLossBalancer()` - Combines original student loss and KD loss into a single weighted sum (without changing over time)
+### Supported Models
+> [!NOTE]
+> The following are models that were confirmed to run with ModelOpt distillation, but it is absolutely not limited to these
+| Model | type | confirmed compatible |
+| :---: | :---: | :---: |
+| Nemotron | gpt | ✅ |
+| Llama 3 | llama | ✅ |
+| Llama 4 | llama | ✅ |
+| Gemma 2 | gemma | ✅ |
+| Gemma 3 | gemma | ✅ |
+| Phi 3 | phi | ✅ |
+| Qwen 2 | qwen2 | ✅ |
+| Qwen 3 | qwen3 | ✅ |
+| Mamba | mamba | ✅ |
+## Knowledge Distillation (KD) in NVIDIA Megatron-LM Framework
+Checkout the Knowledge Distillation example in the [Megatron-LM repository](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
+## Knowledge Distillation (KD) in NVIDIA NeMo Framework
+Checkout the stand-alone distillation script in the [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).
+You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Qwen 3 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
+## Knowledge Distillation (KD) for HuggingFace Models
+In this e2e example we finetune Llama-3.2 models on the [smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT)
+dataset as a minimal example to demonstrate a simple way of integrating Model Optimizer's KD feature.
+We replace normal supervised finetuning (SFT) of a Llama-3.2-1B base model by distilling information from Llama-3.2-3B-Instruct which has already been instruction-finetuned.
+> [!NOTE]
+> We can fit the following in memory using [FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp) enabled on 8x RTX 6000 (total ~400GB VRAM)
+```bash
+accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
+    main.py \
+    --teacher_name_or_path 'meta-llama/Llama-3.2-3B-Instruct' \
+    --student_name_or_path 'meta-llama/Llama-3.2-1B' \
+    --output_dir ./llama3.2-distill \
+    --max_length 2048 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 8 \
+    --max_steps 200 \
+    --logging_steps 5
+```
+## Resources
+- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
+- 📖 [Documentation](https://nvidia.github.io/Model-Optimizer)
+- 🎯 [Benchmarks](../benchmark.md)
+- 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)
+- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)
+- ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)

accelerate_config/fsdp2.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: false
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_cpu_ram_efficient_loading: false
+  fsdp_offload_params: false
+  fsdp_reshard_after_forward: true
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+  fsdp_version: 2
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: gpu
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

distill.py ADDED Viewed

	@@ -0,0 +1,78 @@

+# Generated by Apertus on Public AI
+from smol import DistillationTrainer
+from transformers import AutoModel, AutoTokenizer
+from transformers import DistilBERTForSequenceClassification
+from transformers import AdamW
+import torch
+import torch.nn as nn
+# Step 1: Load the large model (teacher model)
+# Assuming you have a large model (e.g., 8B parameters) and a tokenizer
+teacher_model = AutoModel.from_pretrained("swiss-ai/Apertus-8B-Instruct-2509")
+tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-Nemo-Base-2407")
+# Step 2: Choose the smaller model (student model)
+# Here, we use DistilBERT as an example
+student_model = DistilBERTForSequenceClassification.from_pretrained("distilbert-base-uncased")
+# Define the distillation loss function (e.g., using KLDivLoss)
+class DistillationLoss(nn.Module):
+    def __init__(self, temperature, alpha):
+        super(DistillationLoss, self).__init__()
+        self.kl_loss = nn.KLDivLoss(temperature=temperature)
+        self.alpha = alpha
+    def forward(self, student_output, teacher_output):
+        return self.kl_loss(student_output.log_softmax(-1), teacher_output.softmax(-1)) * self.alpha
+# Define a simple training loop
+def train_step(model, batch, optimizer, loss_fn, device):
+    # Preprocess batch
+    inputs = tokenizer(batch["input_ids"], **tokenizer_args)  # Tokenize the input
+    labels = batch["labels"]
+    # Forward pass with teacher model
+    with torch.no_grad():
+        teacher_output = model(**inputs)
+        teacher_output = teacher_output.logits if "logits" in teacher_output else teacher_output.logits  # Handle model output
+        teacher_output = teacher_output.detach().to(device)
+    # Forward pass with student model
+    student_output = model(**inputs)
+    student_logits = student_output.logits if hasattr(student_output, "logits") else student_output.logits  # Handle model output
+    student_logits = student_logits.to(device)
+    # Compute distillation loss
+    distillation_loss = loss_fn(student_logits, teacher_output.softmax(-1))
+    loss = distillation_loss
+    # Compute task loss (e.g., cross-entropy for classification)
+    task_loss = loss_function(student_logits, labels.to(device))  # Replace with your task-specific loss
+    total_loss = distillation_loss + task_loss  # Combine both losses
+    # Backward and optimize
+    optimizer.zero_grad()
+    total_loss.backward()
+    optimizer.step()
+    return total_loss.item(), student_output, teacher_output
+# Initialize SMOL's DistillationTrainer
+from smol.trainer import DistillationTrainer
+trainer = DistillationTrainer(
+    student_model,
+    optimizer=AdamW(student_model.parameters(), lr=1e-5),  # Example learning rate
+    loss_fn=DistillationLoss(temperature=1.0, alpha=0.5),  # Example distillation loss
+    train_dataset=your_train_dataset,  # Your training dataset
+    eval_dataset=your_eval_dataset,  # Your evaluation dataset
+    device="cuda" if torch.cuda.is_available() else "cpu",  # Use GPU if available
+    num_epochs=5,  # Number of epochs
+    batch_size=16,  # Batch size
+    log_dir="distillation_logs",  # Log directory
+)
+# Train the model
+trainer.train()
+# Alternatively, you can use SMOL's simplified training loop (as of SMOL 0.3.0, check the latest docs)
+# trainer.train(steps=1000, evaluate_every=100, ...)

hello.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from distill!")
+if __name__ == "__main__":
+    main()

main.py ADDED Viewed

	@@ -0,0 +1,160 @@

+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from dataclasses import dataclass
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+import datasets
+import torch
+import torch.distributed
+import transformers
+from accelerate.logging import get_logger
+from transformers import AutoTokenizer
+from trl import SFTTrainer
+import modelopt.torch.opt as mto
+from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
+logger = get_logger(__name__, log_level="INFO")
+@dataclass
+class ModelArguments:
+    teacher_name_or_path: str | None = None
+    student_name_or_path: str | None = None
+@dataclass
+class TrainingArguments(transformers.TrainingArguments):
+    do_train: bool = True
+    do_eval: bool = True
+    save_strategy: str = "no"
+    max_length: int = 1024
+    optim: str = "adamw_torch"
+    learning_rate: float = 1e-5
+    lr_scheduler_type: str = "cosine"
+    dataloader_drop_last: bool = True
+    dataset_num_proc: int = 8
+    bf16: bool = True
+    #tf32: bool = True
+def _format_smoltalk_chat_template(sample, tokenizer):
+    # smol-smoltalk-Interaction-SFT dataset has "query" and "answer" fields
+    # Convert them to messages format and use tokenizer's apply_chat_template
+    messages = [
+        {"role": "user", "content": sample["query"]},
+        {"role": "assistant", "content": sample["answer"]},
+    ]
+    return tokenizer.apply_chat_template(messages, tokenize=False)
+class KDSFTTrainer(KDTrainer, SFTTrainer):
+    pass
+def train():
+    parser = transformers.HfArgumentParser((ModelArguments, TrainingArguments))
+    model_args, training_args = parser.parse_args_into_dataclasses()
+    # Enable automatic save/load of modelopt state huggingface checkpointing
+    # modelopt state will be saved automatically to "modelopt_state.pth"
+    mto.enable_huggingface_checkpointing()
+    # Set total batch size across all ranks to equal 64
+    total_batch_size = 64
+    num_accum_steps = total_batch_size / (
+        training_args.per_device_train_batch_size * torch.distributed.get_world_size()
+    )
+    if not num_accum_steps.is_integer():
+        raise ValueError(
+            f"`per_device_train_batch_size` * `world_size` must be a factor of {total_batch_size}"
+        )
+    training_args.gradient_accumulation_steps = int(num_accum_steps)
+    logger.info(
+        f"Using {int(num_accum_steps)} grad accumulation steps for effective batchsize of {total_batch_size}."
+    )
+    # Dataset
+    logger.info("Loading dataset...")
+    dset = datasets.load_dataset("ReactiveAI/smol-smoltalk-Interaction-SFT", split="train")
+    dset_splits = dset.train_test_split(train_size=12800, test_size=1280, seed=420)
+    dset_train, dset_eval = dset_splits["train"], dset_splits["test"]
+    logger.info("Dataset loaded.")
+    # Tokenizer
+    logger.info("Loading tokenizer...")
+    model_path = model_args.teacher_name_or_path or model_args.student_name_or_path
+    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
+    tokenizer.pad_token = tokenizer.eos_token
+    tokenizer.padding_side = "right"
+    logger.info("Tokenizer loaded.")
+    # Model(s)
+    logger.info("Loading student model...")
+    model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.student_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
+    )
+    logger.info("Student loaded.")
+    logger.info("Loading teacher model...")
+    teacher_model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.teacher_name_or_path, dtype=torch.bfloat16 if training_args.bf16 else None
+    )
+    # Distillation configuration
+    kd_config = {
+        "teacher_model": teacher_model,
+        "criterion": LMLogitsLoss(),
+    }
+    # Fix problematic settings that logger.info excessive warnings
+    model.generation_config.temperature = None
+    model.generation_config.top_p = None
+    # Trainer
+    trainer = KDSFTTrainer(
+        model,
+        training_args,
+        distill_config=kd_config,
+        train_dataset=dset_train,
+        eval_dataset=dset_eval,
+        formatting_func=lambda sample: _format_smoltalk_chat_template(sample, tokenizer),
+        processing_class=tokenizer,
+    )
+    # Do training
+    if training_args.do_train:
+        logger.info("Beginning training...")
+        trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        logger.info("Training done.")
+    # Do evaluation
+    if training_args.do_eval:
+        logger.info("Evaluating...")
+        eval_results = trainer.evaluate()
+        logger.info(eval_results)
+        logger.info("Evaluation complete.")
+    # Save checkpoint
+    logger.info("Saving checkpoint...")
+    trainer.save_state()
+    trainer.save_model(trainer.args.output_dir)
+    logger.info("Checkpoint saved.")
+if __name__ == "__main__":
+    train()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,9 @@

+[project]
+name = "distill"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "smol>=0.5.7",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+pyarrow
+torchao>=0.14.1
+transformers<5.0
+trl>=0.23.0

run.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+uv run accelerate launch --config-file ./accelerate_config/fsdp2.yaml \
+    main.py \
+    --teacher_name_or_path 'swiss-ai/Apertus-8B-Instruct-2509' \
+    --student_name_or_path 'HuggingFaceTB/SmolLM2-135M-Instruct' \
+    --output_dir ./Apertus-8B-distill \
+    --max_length 2048 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 8 \
+    --max_steps 200 \
+    --logging_steps 5

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff