File size: 6,419 Bytes
9c56736
60f2601
 
 
 
 
 
 
 
41420d0
60f2601
 
41420d0
60f2601
 
 
 
 
9c56736
 
60f2601
9c56736
60f2601
9c56736
73d6b3e
60f2601
9c56736
60f2601
 
 
 
9c56736
73d6b3e
 
60f2601
9c56736
60f2601
9c56736
73d6b3e
60f2601
9c56736
60f2601
 
 
 
 
9c56736
60f2601
9c56736
60f2601
9c56736
73d6b3e
 
 
 
 
 
 
 
 
 
 
9c56736
60f2601
9c56736
60f2601
 
 
9c56736
73d6b3e
 
 
 
 
 
60f2601
9c56736
60f2601
9c56736
73d6b3e
60f2601
73d6b3e
60f2601
9c56736
60f2601
9c56736
60f2601
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c56736
60f2601
9c56736
60f2601
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c56736
60f2601
9c56736
60f2601
9c56736
60f2601
 
 
 
 
9c56736
60f2601
9c56736
60f2601
 
 
 
 
9c56736
60f2601
9c56736
60f2601
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c56736
60f2601
9c56736
60f2601
9c56736
60f2601
9c56736
60f2601
 
 
 
 
9c56736
60f2601
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
language:
- en
tags:
- optipfair
- rearchitecting-llms
- depth-pruning
- model-optimization
- small-language-model
- Qwen-3.5
- educational
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B-Base
metrics:
- perplexity
- accuracy
datasets:
- HuggingFaceTB/cosmopedia
---

# Qwen3.5-0.65B-Base-Rearchitected

## Model Description

This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**, 
created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.

* **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
* **Framework:** [OptiPFair](https://github.com/peremartra/optipfair)
* **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
* **Chapter:** Chapter 6 - Knowledge Recovery

[![linkedin-profile-banner-martra](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/sa4ivCbm8kk6C9NAPmb-x.jpeg)](https://hubs.la/Q040tvsK0)

---

## Performance & Retention Metrics

The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities.
### Retention Summary (vs Teacher Baseline)

| Metric | Value | Description |
|:---|:---|:---|
| **PPL Retention** | 109.62% | Linguistic quality preserved (Teacher PPL / Student PPL × 100) |
| **Capabilities Retention** | 89.21% | Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100) |
| **Overall Retention** | 92.11% | Combined health score (average of PPL + Capabilities retention) |

### Capability Benchmarks (LM Evaluation Harness)

**Recovery** = How much of the pruning degradation was recovered through distillation.

| Benchmark | Teacher | Pruned (No KD) | (After KD) |
|:---|:---:|:---:|:---:|
| **Arc Easy** | 67.5% | 56.3% | 60.7% |
| **Winogrande** | 59.4% | 55.5% | 55.9% |
| **Hellaswag** | 54.9% | 44.0% | 47.2% |
| **Lambada Openai** | 50.9% | 8.4% | 39.9% |
| **Piqa** | 71.5% | 63.6% | 67.7% |
| **Average** | 60.8% | 45.5% | 54.3% |


![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/FlaxH7EQBiFOBdk-fEpSN.png)

### Linguistic Quality

* **Final Perplexity (PPL):** 6.70
* **Teacher Baseline PPL:** 7.34
* **Pruned (No KD) PPL:** 24.29

> **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge.


![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/2CDSSYlVJib7nHW84PyIY.png)


---

## Architecture Details

* **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters)
* **Student Model:** Pruned to (666,171,584 parameters)
* **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22])
* **Parameter Reduction:** 11.46%

---

## Training Procedure

### Dataset
* **Source:** [Cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
* **Samples:** 40,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples)
* **Train/Val Split:** 80% / 20%

### Hyperparameters
* **Epochs:** 1
* **Batch Size:** 12 (effective: 48 with gradient accumulation)
* **Learning Rate:** 4e-05
* **Loss Function:** `α·CrossEntropy + β·Skew-KLD`
  * Task Loss Weight (α): 0.5
  * Logits Loss Weight (β): 0.5
  * Skew Interpolation Factor: 0.0
  * Temperature: 2.0
* **Optimizer:** AdamW
* **Gradient Clipping:** 1.0

### Hardware & Training Time
* **GPU:** NVIDIA A100-SXM4-80GB
* **Training Time:** 4011.1s (66.85 minutes)
* **Avg Time per Epoch:** 4011.1s

---

## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "oopere/Qwen3.5-0.65B-Base-Rearchitected"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Paris is the capital of"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False,
    num_beams=3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Limitations & Intended Use

### Intended Use
This is an **educational model** created as part of the **Hands-on Lab in Chapter 6** of "Rearchitecting LLMs". It demonstrates:
- Surgical depth pruning using data-driven layer importance analysis
- Knowledge recovery through labels-only distillation with Skew KL Divergence
- The complete optimization pipeline: Prune → Distill → Evaluate

**Not intended for production use.** This model serves as a learning artifact and baseline for readers to improve upon.

### Limitations
- **Training Data:** General-purpose Cosmopedia corpus (not domain-specialized)
- **Knowledge Coverage:** Reduced compared to full-scale models due to structural pruning
- **Capabilities:** Best suited for simple completion tasks; complex reasoning may be degraded
- **Language:** English only

---

## Citation

If you use this model or the techniques described in your research or projects, please cite:

### Book
```bibtex
@book{martra2026rearchitecting,
  author    = {Pere Martra},
  title     = {Rearchitecting LLMs: Structural techniques for efficient models},
  publisher = {Manning Publications},
  year      = {2026},
  url       = {https://hubs.la/Q040tvtp0}
}
```

### Framework
```bibtex
@software{optipfair2024,
  author = {Pere Martra},
  title  = {OptiPFair: Structural Pruning and Bias Analysis for LLMs},
  year   = {2024},
  url    = {https://github.com/peremartra/optipfair}
}
```

---

## Acknowledgments

This model was created following the methodologies taught in **"Rearchitecting LLMs"** (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch.

**Challenge for readers:** Can you improve the retention metrics beyond 92.1%? Try adjusting:
- Layer selection strategy (use cosine similarity analysis)
- Distillation dataset (domain-specific data)
- Loss function weights (α, β, temperature)
- Training epochs and learning rate

Share your results in the [book's discussion forum](https://hubs.la/Q040tvtp0)!