oopere's picture
add charts
73d6b3e verified
---
language:
- en
tags:
- optipfair
- rearchitecting-llms
- depth-pruning
- model-optimization
- small-language-model
- Qwen-3.5
- educational
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B-Base
metrics:
- perplexity
- accuracy
datasets:
- HuggingFaceTB/cosmopedia
---
# Qwen3.5-0.65B-Base-Rearchitected
## Model Description
This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**,
created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.
* **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
* **Framework:** [OptiPFair](https://github.com/peremartra/optipfair)
* **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
* **Chapter:** Chapter 6 - Knowledge Recovery
[![linkedin-profile-banner-martra](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/sa4ivCbm8kk6C9NAPmb-x.jpeg)](https://hubs.la/Q040tvsK0)
---
## Performance & Retention Metrics
The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities.
### Retention Summary (vs Teacher Baseline)
| Metric | Value | Description |
|:---|:---|:---|
| **PPL Retention** | 109.62% | Linguistic quality preserved (Teacher PPL / Student PPL × 100) |
| **Capabilities Retention** | 89.21% | Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100) |
| **Overall Retention** | 92.11% | Combined health score (average of PPL + Capabilities retention) |
### Capability Benchmarks (LM Evaluation Harness)
**Recovery** = How much of the pruning degradation was recovered through distillation.
| Benchmark | Teacher | Pruned (No KD) | (After KD) |
|:---|:---:|:---:|:---:|
| **Arc Easy** | 67.5% | 56.3% | 60.7% |
| **Winogrande** | 59.4% | 55.5% | 55.9% |
| **Hellaswag** | 54.9% | 44.0% | 47.2% |
| **Lambada Openai** | 50.9% | 8.4% | 39.9% |
| **Piqa** | 71.5% | 63.6% | 67.7% |
| **Average** | 60.8% | 45.5% | 54.3% |
![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/FlaxH7EQBiFOBdk-fEpSN.png)
### Linguistic Quality
* **Final Perplexity (PPL):** 6.70
* **Teacher Baseline PPL:** 7.34
* **Pruned (No KD) PPL:** 24.29
> **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge.
![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/2CDSSYlVJib7nHW84PyIY.png)
---
## Architecture Details
* **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters)
* **Student Model:** Pruned to (666,171,584 parameters)
* **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22])
* **Parameter Reduction:** 11.46%
---
## Training Procedure
### Dataset
* **Source:** [Cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
* **Samples:** 40,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples)
* **Train/Val Split:** 80% / 20%
### Hyperparameters
* **Epochs:** 1
* **Batch Size:** 12 (effective: 48 with gradient accumulation)
* **Learning Rate:** 4e-05
* **Loss Function:** `α·CrossEntropy + β·Skew-KLD`
* Task Loss Weight (α): 0.5
* Logits Loss Weight (β): 0.5
* Skew Interpolation Factor: 0.0
* Temperature: 2.0
* **Optimizer:** AdamW
* **Gradient Clipping:** 1.0
### Hardware & Training Time
* **GPU:** NVIDIA A100-SXM4-80GB
* **Training Time:** 4011.1s (66.85 minutes)
* **Avg Time per Epoch:** 4011.1s
---
## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_id = "oopere/Qwen3.5-0.65B-Base-Rearchitected"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Generate text
prompt = "Paris is the capital of"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False,
num_beams=3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Limitations & Intended Use
### Intended Use
This is an **educational model** created as part of the **Hands-on Lab in Chapter 6** of "Rearchitecting LLMs". It demonstrates:
- Surgical depth pruning using data-driven layer importance analysis
- Knowledge recovery through labels-only distillation with Skew KL Divergence
- The complete optimization pipeline: Prune → Distill → Evaluate
**Not intended for production use.** This model serves as a learning artifact and baseline for readers to improve upon.
### Limitations
- **Training Data:** General-purpose Cosmopedia corpus (not domain-specialized)
- **Knowledge Coverage:** Reduced compared to full-scale models due to structural pruning
- **Capabilities:** Best suited for simple completion tasks; complex reasoning may be degraded
- **Language:** English only
---
## Citation
If you use this model or the techniques described in your research or projects, please cite:
### Book
```bibtex
@book{martra2026rearchitecting,
author = {Pere Martra},
title = {Rearchitecting LLMs: Structural techniques for efficient models},
publisher = {Manning Publications},
year = {2026},
url = {https://hubs.la/Q040tvtp0}
}
```
### Framework
```bibtex
@software{optipfair2024,
author = {Pere Martra},
title = {OptiPFair: Structural Pruning and Bias Analysis for LLMs},
year = {2024},
url = {https://github.com/peremartra/optipfair}
}
```
---
## Acknowledgments
This model was created following the methodologies taught in **"Rearchitecting LLMs"** (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch.
**Challenge for readers:** Can you improve the retention metrics beyond 92.1%? Try adjusting:
- Layer selection strategy (use cosine similarity analysis)
- Distillation dataset (domain-specific data)
- Loss function weights (α, β, temperature)
- Training epochs and learning rate
Share your results in the [book's discussion forum](https://hubs.la/Q040tvtp0)!