File size: 6,419 Bytes

---
language:
- en
tags:
- optipfair
- rearchitecting-llms
- depth-pruning
- model-optimization
- small-language-model
- Qwen-3.5
- educational
license: apache-2.0
base_model: Qwen/Qwen3.5-0.8B-Base
metrics:
- perplexity
- accuracy
datasets:
- HuggingFaceTB/cosmopedia
---

# Qwen3.5-0.65B-Base-Rearchitected

## Model Description

This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**, 
created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**.

* **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0)
* **Framework:** [OptiPFair](https://github.com/peremartra/optipfair)
* **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence)
* **Chapter:** Chapter 6 - Knowledge Recovery

[![linkedin-profile-banner-martra](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/sa4ivCbm8kk6C9NAPmb-x.jpeg)](https://hubs.la/Q040tvsK0)

---

## Performance & Retention Metrics

The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities.
### Retention Summary (vs Teacher Baseline)

| Metric | Value | Description |
|:---|:---|:---|
| **PPL Retention** | 109.62% | Linguistic quality preserved (Teacher PPL / Student PPL × 100) |
| **Capabilities Retention** | 89.21% | Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100) |
| **Overall Retention** | 92.11% | Combined health score (average of PPL + Capabilities retention) |

### Capability Benchmarks (LM Evaluation Harness)

**Recovery** = How much of the pruning degradation was recovered through distillation.

| Benchmark | Teacher | Pruned (No KD) | (After KD) |
|:---|:---:|:---:|:---:|
| **Arc Easy** | 67.5% | 56.3% | 60.7% |
| **Winogrande** | 59.4% | 55.5% | 55.9% |
| **Hellaswag** | 54.9% | 44.0% | 47.2% |
| **Lambada Openai** | 50.9% | 8.4% | 39.9% |
| **Piqa** | 71.5% | 63.6% | 67.7% |
| **Average** | 60.8% | 45.5% | 54.3% |


![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/FlaxH7EQBiFOBdk-fEpSN.png)

### Linguistic Quality

* **Final Perplexity (PPL):** 6.70
* **Teacher Baseline PPL:** 7.34
* **Pruned (No KD) PPL:** 24.29

> **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge.


![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/2CDSSYlVJib7nHW84PyIY.png)


---

## Architecture Details

* **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters)
* **Student Model:** Pruned to (666,171,584 parameters)
* **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22])
* **Parameter Reduction:** 11.46%

---

## Training Procedure

### Dataset
* **Source:** [Cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
* **Samples:** 40,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples)
* **Train/Val Split:** 80% / 20%

### Hyperparameters
* **Epochs:** 1
* **Batch Size:** 12 (effective: 48 with gradient accumulation)
* **Learning Rate:** 4e-05
* **Loss Function:** `α·CrossEntropy + β·Skew-KLD`
  * Task Loss Weight (α): 0.5
  * Logits Loss Weight (β): 0.5
  * Skew Interpolation Factor: 0.0
  * Temperature: 2.0
* **Optimizer:** AdamW
* **Gradient Clipping:** 1.0

### Hardware & Training Time
* **GPU:** NVIDIA A100-SXM4-80GB
* **Training Time:** 4011.1s (66.85 minutes)
* **Avg Time per Epoch:** 4011.1s

---

## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "oopere/Qwen3.5-0.65B-Base-Rearchitected"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Generate text
prompt = "Paris is the capital of"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=False,
    num_beams=3
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Limitations & Intended Use

### Intended Use
This is an **educational model** created as part of the **Hands-on Lab in Chapter 6** of "Rearchitecting LLMs". It demonstrates:
- Surgical depth pruning using data-driven layer importance analysis
- Knowledge recovery through labels-only distillation with Skew KL Divergence
- The complete optimization pipeline: Prune → Distill → Evaluate

**Not intended for production use.** This model serves as a learning artifact and baseline for readers to improve upon.

### Limitations
- **Training Data:** General-purpose Cosmopedia corpus (not domain-specialized)
- **Knowledge Coverage:** Reduced compared to full-scale models due to structural pruning
- **Capabilities:** Best suited for simple completion tasks; complex reasoning may be degraded
- **Language:** English only

---

## Citation

If you use this model or the techniques described in your research or projects, please cite:

### Book
```bibtex
@book{martra2026rearchitecting,
  author    = {Pere Martra},
  title     = {Rearchitecting LLMs: Structural techniques for efficient models},
  publisher = {Manning Publications},
  year      = {2026},
  url       = {https://hubs.la/Q040tvtp0}
}
```

### Framework
```bibtex
@software{optipfair2024,
  author = {Pere Martra},
  title  = {OptiPFair: Structural Pruning and Bias Analysis for LLMs},
  year   = {2024},
  url    = {https://github.com/peremartra/optipfair}
}
```

---

## Acknowledgments

This model was created following the methodologies taught in **"Rearchitecting LLMs"** (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch.

**Challenge for readers:** Can you improve the retention metrics beyond 92.1%? Try adjusting:
- Layer selection strategy (use cosine similarity analysis)
- Distillation dataset (domain-specific data)
- Loss function weights (α, β, temperature)
- Training epochs and learning rate

Share your results in the [book's discussion forum](https://hubs.la/Q040tvtp0)!