| | --- |
| | language: |
| | - en |
| | tags: |
| | - optipfair |
| | - rearchitecting-llms |
| | - depth-pruning |
| | - model-optimization |
| | - small-language-model |
| | - Qwen-3.5 |
| | - educational |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3.5-0.8B-Base |
| | metrics: |
| | - perplexity |
| | - accuracy |
| | datasets: |
| | - HuggingFaceTB/cosmopedia |
| | --- |
| | |
| | # Qwen3.5-0.65B-Base-Rearchitected |
| |
|
| | ## Model Description |
| |
|
| | This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**, |
| | created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**. |
| |
|
| | * **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0) |
| | * **Framework:** [OptiPFair](https://github.com/peremartra/optipfair) |
| | * **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence) |
| | * **Chapter:** Chapter 6 - Knowledge Recovery |
| |
|
| | [](https://hubs.la/Q040tvsK0) |
| |
|
| | --- |
| |
|
| | ## Performance & Retention Metrics |
| |
|
| | The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities. |
| | ### Retention Summary (vs Teacher Baseline) |
| |
|
| | | Metric | Value | Description | |
| | |:---|:---|:---| |
| | | **PPL Retention** | 109.62% | Linguistic quality preserved (Teacher PPL / Student PPL × 100) | |
| | | **Capabilities Retention** | 89.21% | Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100) | |
| | | **Overall Retention** | 92.11% | Combined health score (average of PPL + Capabilities retention) | |
| |
|
| | ### Capability Benchmarks (LM Evaluation Harness) |
| |
|
| | **Recovery** = How much of the pruning degradation was recovered through distillation. |
| |
|
| | | Benchmark | Teacher | Pruned (No KD) | (After KD) | |
| | |:---|:---:|:---:|:---:| |
| | | **Arc Easy** | 67.5% | 56.3% | 60.7% | |
| | | **Winogrande** | 59.4% | 55.5% | 55.9% | |
| | | **Hellaswag** | 54.9% | 44.0% | 47.2% | |
| | | **Lambada Openai** | 50.9% | 8.4% | 39.9% | |
| | | **Piqa** | 71.5% | 63.6% | 67.7% | |
| | | **Average** | 60.8% | 45.5% | 54.3% | |
| |
|
| |
|
| |  |
| |
|
| | ### Linguistic Quality |
| |
|
| | * **Final Perplexity (PPL):** 6.70 |
| | * **Teacher Baseline PPL:** 7.34 |
| | * **Pruned (No KD) PPL:** 24.29 |
| |
|
| | > **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge. |
| |
|
| |
|
| |  |
| |
|
| |
|
| | --- |
| |
|
| | ## Architecture Details |
| |
|
| | * **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters) |
| | * **Student Model:** Pruned to (666,171,584 parameters) |
| | * **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22]) |
| | * **Parameter Reduction:** 11.46% |
| |
|
| | --- |
| |
|
| | ## Training Procedure |
| |
|
| | ### Dataset |
| | * **Source:** [Cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |
| | * **Samples:** 40,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples) |
| | * **Train/Val Split:** 80% / 20% |
| | |
| | ### Hyperparameters |
| | * **Epochs:** 1 |
| | * **Batch Size:** 12 (effective: 48 with gradient accumulation) |
| | * **Learning Rate:** 4e-05 |
| | * **Loss Function:** `α·CrossEntropy + β·Skew-KLD` |
| | * Task Loss Weight (α): 0.5 |
| | * Logits Loss Weight (β): 0.5 |
| | * Skew Interpolation Factor: 0.0 |
| | * Temperature: 2.0 |
| | * **Optimizer:** AdamW |
| | * **Gradient Clipping:** 1.0 |
| | |
| | ### Hardware & Training Time |
| | * **GPU:** NVIDIA A100-SXM4-80GB |
| | * **Training Time:** 4011.1s (66.85 minutes) |
| | * **Avg Time per Epoch:** 4011.1s |
| | |
| | --- |
| | |
| | ## How to Use |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | # Load model and tokenizer |
| | model_id = "oopere/Qwen3.5-0.65B-Base-Rearchitected" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained(model_id) |
| |
|
| | # Generate text |
| | prompt = "Paris is the capital of" |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=50, |
| | do_sample=False, |
| | num_beams=3 |
| | ) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| | |
| | --- |
| | |
| | ## Limitations & Intended Use |
| | |
| | ### Intended Use |
| | This is an **educational model** created as part of the **Hands-on Lab in Chapter 6** of "Rearchitecting LLMs". It demonstrates: |
| | - Surgical depth pruning using data-driven layer importance analysis |
| | - Knowledge recovery through labels-only distillation with Skew KL Divergence |
| | - The complete optimization pipeline: Prune → Distill → Evaluate |
| | |
| | **Not intended for production use.** This model serves as a learning artifact and baseline for readers to improve upon. |
| | |
| | ### Limitations |
| | - **Training Data:** General-purpose Cosmopedia corpus (not domain-specialized) |
| | - **Knowledge Coverage:** Reduced compared to full-scale models due to structural pruning |
| | - **Capabilities:** Best suited for simple completion tasks; complex reasoning may be degraded |
| | - **Language:** English only |
| | |
| | --- |
| | |
| | ## Citation |
| | |
| | If you use this model or the techniques described in your research or projects, please cite: |
| | |
| | ### Book |
| | ```bibtex |
| | @book{martra2026rearchitecting, |
| | author = {Pere Martra}, |
| | title = {Rearchitecting LLMs: Structural techniques for efficient models}, |
| | publisher = {Manning Publications}, |
| | year = {2026}, |
| | url = {https://hubs.la/Q040tvtp0} |
| | } |
| | ``` |
| | |
| | ### Framework |
| | ```bibtex |
| | @software{optipfair2024, |
| | author = {Pere Martra}, |
| | title = {OptiPFair: Structural Pruning and Bias Analysis for LLMs}, |
| | year = {2024}, |
| | url = {https://github.com/peremartra/optipfair} |
| | } |
| | ``` |
| | |
| | --- |
| | |
| | ## Acknowledgments |
| | |
| | This model was created following the methodologies taught in **"Rearchitecting LLMs"** (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch. |
| | |
| | **Challenge for readers:** Can you improve the retention metrics beyond 92.1%? Try adjusting: |
| | - Layer selection strategy (use cosine similarity analysis) |
| | - Distillation dataset (domain-specific data) |
| | - Loss function weights (α, β, temperature) |
| | - Training epochs and learning rate |
| | |
| | Share your results in the [book's discussion forum](https://hubs.la/Q040tvtp0)! |
| | |