--- language: - en tags: - optipfair - rearchitecting-llms - depth-pruning - model-optimization - small-language-model - Qwen-3.5 - educational license: apache-2.0 base_model: Qwen/Qwen3.5-0.8B-Base metrics: - perplexity - accuracy datasets: - HuggingFaceTB/cosmopedia --- # Qwen3.5-0.65B-Base-Rearchitected ## Model Description This model is a surgically optimized and distilled version of **Qwen3.5-0.8B-Base**, created with the techniques covered in **Chapter 6** in the book **"Rearchitecting LLMs"**. * **Book:** [Rearchitecting LLMs](https://hubs.la/Q040tvtp0) * **Framework:** [OptiPFair](https://github.com/peremartra/optipfair) * **Technique:** Depth Pruning + Knowledge Distillation (Labels-Only with Skew KL Divergence) * **Chapter:** Chapter 6 - Knowledge Recovery [![linkedin-profile-banner-martra](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/sa4ivCbm8kk6C9NAPmb-x.jpeg)](https://hubs.la/Q040tvsK0) --- ## Performance & Retention Metrics The goal of this optimization was twofold: to maximize parameter efficiency through structural pruning, and to perform a stylistic domain adaptation to the Cosmopedia dataset while retaining the Teacher's core reasoning capabilities. ### Retention Summary (vs Teacher Baseline) | Metric | Value | Description | |:---|:---|:---| | **PPL Retention** | 109.62% | Linguistic quality preserved (Teacher PPL / Student PPL × 100) | | **Capabilities Retention** | 89.21% | Reasoning power retained across benchmarks (Avg Student / Avg Teacher × 100) | | **Overall Retention** | 92.11% | Combined health score (average of PPL + Capabilities retention) | ### Capability Benchmarks (LM Evaluation Harness) **Recovery** = How much of the pruning degradation was recovered through distillation. | Benchmark | Teacher | Pruned (No KD) | (After KD) | |:---|:---:|:---:|:---:| | **Arc Easy** | 67.5% | 56.3% | 60.7% | | **Winogrande** | 59.4% | 55.5% | 55.9% | | **Hellaswag** | 54.9% | 44.0% | 47.2% | | **Lambada Openai** | 50.9% | 8.4% | 39.9% | | **Piqa** | 71.5% | 63.6% | 67.7% | | **Average** | 60.8% | 45.5% | 54.3% | ![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/FlaxH7EQBiFOBdk-fEpSN.png) ### Linguistic Quality * **Final Perplexity (PPL):** 6.70 * **Teacher Baseline PPL:** 7.34 * **Pruned (No KD) PPL:** 24.29 > **Note on Perplexity:** The Student achieves a lower (better) PPL than the Teacher. This highlights the **Domain Adaptation** effect of the distillation process. The Student successfully specialized in the tone and structure of the Cosmopedia training corpus, refining its style while recovering structural knowledge. ![image](https://cdn-uploads.huggingface.co/production/uploads/640f7924f2d7c41a1e9eced1/2CDSSYlVJib7nHW84PyIY.png) --- ## Architecture Details * **Teacher Model:** `Qwen3.5-0.8B-Base` (752,393,024 parameters) * **Student Model:** Pruned to (666,171,584 parameters) * **Layers Removed:** 4 Tranformer blocks removed (indices: [21, 20, 9, 22]) * **Parameter Reduction:** 11.46% --- ## Training Procedure ### Dataset * **Source:** [Cosmopedia-v2](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) * **Samples:** 40,000 (balanced across 4 subsets: stories, wikihow, openstax, web_samples) * **Train/Val Split:** 80% / 20% ### Hyperparameters * **Epochs:** 1 * **Batch Size:** 12 (effective: 48 with gradient accumulation) * **Learning Rate:** 4e-05 * **Loss Function:** `α·CrossEntropy + β·Skew-KLD` * Task Loss Weight (α): 0.5 * Logits Loss Weight (β): 0.5 * Skew Interpolation Factor: 0.0 * Temperature: 2.0 * **Optimizer:** AdamW * **Gradient Clipping:** 1.0 ### Hardware & Training Time * **GPU:** NVIDIA A100-SXM4-80GB * **Training Time:** 4011.1s (66.85 minutes) * **Avg Time per Epoch:** 4011.1s --- ## How to Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model_id = "oopere/Qwen3.5-0.65B-Base-Rearchitected" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) # Generate text prompt = "Paris is the capital of" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_new_tokens=50, do_sample=False, num_beams=3 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## Limitations & Intended Use ### Intended Use This is an **educational model** created as part of the **Hands-on Lab in Chapter 6** of "Rearchitecting LLMs". It demonstrates: - Surgical depth pruning using data-driven layer importance analysis - Knowledge recovery through labels-only distillation with Skew KL Divergence - The complete optimization pipeline: Prune → Distill → Evaluate **Not intended for production use.** This model serves as a learning artifact and baseline for readers to improve upon. ### Limitations - **Training Data:** General-purpose Cosmopedia corpus (not domain-specialized) - **Knowledge Coverage:** Reduced compared to full-scale models due to structural pruning - **Capabilities:** Best suited for simple completion tasks; complex reasoning may be degraded - **Language:** English only --- ## Citation If you use this model or the techniques described in your research or projects, please cite: ### Book ```bibtex @book{martra2026rearchitecting, author = {Pere Martra}, title = {Rearchitecting LLMs: Structural techniques for efficient models}, publisher = {Manning Publications}, year = {2026}, url = {https://hubs.la/Q040tvtp0} } ``` ### Framework ```bibtex @software{optipfair2024, author = {Pere Martra}, title = {OptiPFair: Structural Pruning and Bias Analysis for LLMs}, year = {2024}, url = {https://github.com/peremartra/optipfair} } ``` --- ## Acknowledgments This model was created following the methodologies taught in **"Rearchitecting LLMs"** (Manning Publications, 2026). Special thanks to the Manning editorial team and the open-source community behind Hugging Face Transformers and PyTorch. **Challenge for readers:** Can you improve the retention metrics beyond 92.1%? Try adjusting: - Layer selection strategy (use cosine similarity analysis) - Distillation dataset (domain-specific data) - Loss function weights (α, β, temperature) - Training epochs and learning rate Share your results in the [book's discussion forum](https://hubs.la/Q040tvtp0)!