oopere commited on
Commit
73d6b3e
·
verified ·
1 Parent(s): 41420d0

add charts

Browse files
Files changed (1) hide show
  1. README.md +23 -13
README.md CHANGED
@@ -22,7 +22,7 @@ datasets:
22
 
23
  ## Model Description
24
 
25
- This model is a surgically optimized and distilled version of **Qwen3.5-0.5B-Base-Rearchitected**,
26
  created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.
27
 
28
  * **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
@@ -30,12 +30,13 @@ created with the techniques covered in **Chapter 6** in the book **"Rearchitecti
30
  * **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
31
  * **Chapter:** Chapter 6 - Knowledge Recovery
32
 
 
 
33
  ---
34
 
35
  ## Performance & Retention Metrics
36
 
37
- The goal of this optimization was to maximize parameter efficiency while maintaining the highest possible retention of the Teacher's capabilities.
38
-
39
  ### Retention Summary (vs Teacher Baseline)
40
 
41
  | Metric | Value | Description |
@@ -48,14 +49,17 @@ The goal of this optimization was to maximize parameter efficiency while maintai
48
 
49
  **Recovery** = How much of the pruning degradation was recovered through distillation.
50
 
51
- | Benchmark | Teacher | Pruned (No KD) | Student (After KD) | Recovery |
52
- |:---|:---:|:---:|:---:|:---:|
53
- | **Arc Easy** | 67.5% | 56.3% | 60.7% | 39.8% |
54
- | **Winogrande** | 59.4% | 55.5% | 55.9% | 9.9% |
55
- | **Hellaswag** | 54.9% | 44.0% | 47.2% | 29.6% |
56
- | **Lambada Openai** | 50.9% | 8.4% | 39.9% | 74.1% |
57
- | **Piqa** | 71.5% | 63.6% | 67.7% | 51.3% |
58
- | **Average** | 60.8% | 45.5% | 54.3% | 57.1% |
 
 
 
59
 
60
  ### Linguistic Quality
61
 
@@ -63,13 +67,19 @@ The goal of this optimization was to maximize parameter efficiency while maintai
63
  * **Teacher Baseline PPL:** 7.34
64
  * **Pruned (No KD) PPL:** 24.29
65
 
 
 
 
 
 
 
66
  ---
67
 
68
  ## Architecture Details
69
 
70
- * **Teacher Model:** `Qwen3.5-0.5B-Base-Rearchitected` (752,393,024 parameters)
71
  * **Student Model:** Pruned to (666,171,584 parameters)
72
- * **Layers Removed:** 4 layers (indices: [21, 20, 9, 22])
73
  * **Parameter Reduction:** 11.46%
74
 
75
  ---
 
22
 
23
  ## Model Description
24
 
25
+ This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**,
26
  created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.
27
 
28
  * **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
 
30
  * **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
31
  * **Chapter:** Chapter 6 - Knowledge Recovery
32
 
33
+ [![linkedin-profile-banner-martra](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/sa4ivCbm8kk6C9NAPmb-x.jpeg)](https://hubs.la/Q040tvsK0)
34
+
35
  ---
36
 
37
  ## Performance & Retention Metrics
38
 
39
+ The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities.
 
40
  ### Retention Summary (vs Teacher Baseline)
41
 
42
  | Metric | Value | Description |
 
49
 
50
  **Recovery** = How much of the pruning degradation was recovered through distillation.
51
 
52
+ | Benchmark | Teacher | Pruned (No KD) | (After KD) |
53
+ |:---|:---:|:---:|:---:|
54
+ | **Arc Easy** | 67.5% | 56.3% | 60.7% |
55
+ | **Winogrande** | 59.4% | 55.5% | 55.9% |
56
+ | **Hellaswag** | 54.9% | 44.0% | 47.2% |
57
+ | **Lambada Openai** | 50.9% | 8.4% | 39.9% |
58
+ | **Piqa** | 71.5% | 63.6% | 67.7% |
59
+ | **Average** | 60.8% | 45.5% | 54.3% |
60
+
61
+
62
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/FlaxH7EQBiFOBdk-fEpSN.png)
63
 
64
  ### Linguistic Quality
65
 
 
67
  * **Teacher Baseline PPL:** 7.34
68
  * **Pruned (No KD) PPL:** 24.29
69
 
70
+ > **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge.
71
+
72
+
73
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/2CDSSYlVJib7nHW84PyIY.png)
74
+
75
+
76
  ---
77
 
78
  ## Architecture Details
79
 
80
+ * **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters)
81
  * **Student Model:** Pruned to (666,171,584 parameters)
82
+ * **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22])
83
  * **Parameter Reduction:** 11.46%
84
 
85
  ---