add charts
Browse files
README.md
CHANGED
|
@@ -22,7 +22,7 @@ datasets:
|
|
| 22 |
|
| 23 |
## Model Description
|
| 24 |
|
| 25 |
-
This model is a surgically optimized and distilled version of **Qwen3.5-0.
|
| 26 |
created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.
|
| 27 |
|
| 28 |
* **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
|
|
@@ -30,12 +30,13 @@ created with the techniques covered in **Chapter 6** in the book **"Rearchitecti
|
|
| 30 |
* **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
|
| 31 |
* **Chapter:** Chapter 6 - Knowledge Recovery
|
| 32 |
|
|
|
|
|
|
|
| 33 |
---
|
| 34 |
|
| 35 |
## Performance & Retention Metrics
|
| 36 |
|
| 37 |
-
The goal of this optimization was to maximize parameter efficiency
|
| 38 |
-
|
| 39 |
### Retention Summary (vs Teacher Baseline)
|
| 40 |
|
| 41 |
| Metric | Value | Description |
|
|
@@ -48,14 +49,17 @@ The goal of this optimization was to maximize parameter efficiency while maintai
|
|
| 48 |
|
| 49 |
**Recovery** = How much of the pruning degradation was recovered through distillation.
|
| 50 |
|
| 51 |
-
| Benchmark | Teacher | Pruned (No KD) |
|
| 52 |
-
|:---|:---:|:---:|:---:|
|
| 53 |
-
| **Arc Easy** | 67.5% | 56.3% | 60.7% |
|
| 54 |
-
| **Winogrande** | 59.4% | 55.5% | 55.9% |
|
| 55 |
-
| **Hellaswag** | 54.9% | 44.0% | 47.2% |
|
| 56 |
-
| **Lambada Openai** | 50.9% | 8.4% | 39.9% |
|
| 57 |
-
| **Piqa** | 71.5% | 63.6% | 67.7% |
|
| 58 |
-
| **Average** | 60.8% | 45.5% | 54.3% |
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
### Linguistic Quality
|
| 61 |
|
|
@@ -63,13 +67,19 @@ The goal of this optimization was to maximize parameter efficiency while maintai
|
|
| 63 |
* **Teacher Baseline PPL:** 7.34
|
| 64 |
* **Pruned (No KD) PPL:** 24.29
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
---
|
| 67 |
|
| 68 |
## Architecture Details
|
| 69 |
|
| 70 |
-
* **Teacher Model:** `Qwen3.5-0.
|
| 71 |
* **Student Model:** Pruned to (666,171,584 parameters)
|
| 72 |
-
* **Layers Removed:** 4
|
| 73 |
* **Parameter Reduction:** 11.46%
|
| 74 |
|
| 75 |
---
|
|
|
|
| 22 |
|
| 23 |
## Model Description
|
| 24 |
|
| 25 |
+
This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**,
|
| 26 |
created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.
|
| 27 |
|
| 28 |
* **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
|
|
|
|
| 30 |
* **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
|
| 31 |
* **Chapter:** Chapter 6 - Knowledge Recovery
|
| 32 |
|
| 33 |
+
[](https://hubs.la/Q040tvsK0)
|
| 34 |
+
|
| 35 |
---
|
| 36 |
|
| 37 |
## Performance & Retention Metrics
|
| 38 |
|
| 39 |
+
The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities.
|
|
|
|
| 40 |
### Retention Summary (vs Teacher Baseline)
|
| 41 |
|
| 42 |
| Metric | Value | Description |
|
|
|
|
| 49 |
|
| 50 |
**Recovery** = How much of the pruning degradation was recovered through distillation.
|
| 51 |
|
| 52 |
+
| Benchmark | Teacher | Pruned (No KD) | (After KD) |
|
| 53 |
+
|:---|:---:|:---:|:---:|
|
| 54 |
+
| **Arc Easy** | 67.5% | 56.3% | 60.7% |
|
| 55 |
+
| **Winogrande** | 59.4% | 55.5% | 55.9% |
|
| 56 |
+
| **Hellaswag** | 54.9% | 44.0% | 47.2% |
|
| 57 |
+
| **Lambada Openai** | 50.9% | 8.4% | 39.9% |
|
| 58 |
+
| **Piqa** | 71.5% | 63.6% | 67.7% |
|
| 59 |
+
| **Average** | 60.8% | 45.5% | 54.3% |
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+

|
| 63 |
|
| 64 |
### Linguistic Quality
|
| 65 |
|
|
|
|
| 67 |
* **Teacher Baseline PPL:** 7.34
|
| 68 |
* **Pruned (No KD) PPL:** 24.29
|
| 69 |
|
| 70 |
+
> **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge.
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+

|
| 74 |
+
|
| 75 |
+
|
| 76 |
---
|
| 77 |
|
| 78 |
## Architecture Details
|
| 79 |
|
| 80 |
+
* **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters)
|
| 81 |
* **Student Model:** Pruned to (666,171,584 parameters)
|
| 82 |
+
* **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22])
|
| 83 |
* **Parameter Reduction:** 11.46%
|
| 84 |
|
| 85 |
---
|