Oleg Lavrovsky commited on
Commit
8bb2e6f
·
unverified ·
1 Parent(s): 7b45378

README links

Browse files
Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -1,5 +1,20 @@
1
  # Knowledge Distillation
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  Knowledge Distillation is a machine learning technique where a compact "student" model learns to replicate the behavior of a larger, more complex "teacher" model to achieve comparable performance with improved efficiency.
4
 
5
  Model Optimizer's Distillation is a set of wrappers and utilities to easily perform Knowledge Distillation among teacher and student models. Given a pretrained teacher model, Distillation has the potential to train a smaller student model faster and/or with higher accuracy than the student model could achieve on its own.
 
1
  # Knowledge Distillation
2
 
3
+ Source of this doc: https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_distill/README.md
4
+ Additional links:
5
+
6
+ - https://arxiv.org/abs/2601.14051
7
+ - https://arxiv.org/abs/2402.12030
8
+ - https://huggingface.co/docs/transformers/v4.56.2/en/model_doc/apertus
9
+ - https://medium.com/@gsaidheeraj/swiss-ais-apertus-70b-and-8b-a-complete-deep-dive-into-switzerland-s-revolutionary-open-language-90a88b904f6b
10
+ - https://huggingface.co/unsloth/Apertus-8B-Instruct-2509-GGUF
11
+ - https://huggingface.co/daslab-testing/Apertus-1.7B-it360000-SFT/blob/main/README.md
12
+ - https://www.emergentmind.com/papers/2509.14233
13
+ - https://huggingface.co/mistralai/Mistral-Nemo-Base-2407
14
+ - https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Mistral/Supervised_fine_tuning_(SFT)_of_an_LLM_using_Hugging_Face_tooling.ipynb/
15
+
16
+ ---
17
+
18
  Knowledge Distillation is a machine learning technique where a compact "student" model learns to replicate the behavior of a larger, more complex "teacher" model to achieve comparable performance with improved efficiency.
19
 
20
  Model Optimizer's Distillation is a set of wrappers and utilities to easily perform Knowledge Distillation among teacher and student models. Given a pretrained teacher model, Distillation has the potential to train a smaller student model faster and/or with higher accuracy than the student model could achieve on its own.