| --- |
| language: |
| - en |
| - es |
| license: apache-2.0 |
| tags: |
| - moe |
| - gpt-oss |
| - model-compression |
| - pruning |
| - mixture-of-experts |
| - optimization |
| - math |
| - code |
| - logic |
| model_name: Scal-lite-60b-code-math |
| base_model: openai/gpt-oss-120b |
| inference: true |
| pipeline_tag: text-generation |
| --- |
| |
| # Scal-lite-60b-code-math |
|
|
| ## Model Summary |
| **Scal-lite-60b-code-math** is a highly efficient, structurally pruned version of the `gpt-oss-120b` Mixture of Experts (MoE) model. Through **Activation-Guided Structural Pruning**, the model was reduced from 128 to 64 experts, resulting in a 60-billion total parameter architecture (~5.1B active parameters per token). |
|
|
| Unlike standard magnitude-based pruning, this model preserves critical specialized knowledge in low-frequency domains, such as **Spanish language proficiency**, **advanced LaTeX mathematics**, and **strict JSON/Python code generation**. |
|
|
| ## Technical Methodology: "The Surgery" |
|
|
| ### 1. Activation-Guided Sparsity |
| Conventional magnitude pruning (L2 norm) often fails in MoE models because specialized skills (like non-English languages or specific code syntaxes) are often mapped to experts with smaller weight magnitudes. |
|
|
| To prevent "functional lobotomy," we implemented **Activation-Guided Pruning**: |
| - **Forward Hooks:** Monitored `mlp.router` activity during stress tests. |
| - **Utility Ranking:** Identified hyper-specialized experts (e.g., Expert #13) essential for Spanish logic and strict syntax. |
| - **Amputation:** Removed the 64 statistically least-used experts per layer based on real-world utility rather than static weight size. |
|
|
| ### 2. Targeted Router Healing |
| Post-pruning, the original routing network suffers from "Router Trauma" or probability misalignment. To fix this, we applied a lightweight **Targeted Router Healing** process: |
| - **Frozen Experts:** Core knowledge weights remained untouched. |
| - **Trainable Router:** Fine-tuned only the gating network for 3,000 steps using the `MetaMathQA` dataset. |
| - **Result:** Successfully recalibrated the model's internal navigation to access its latent reasoning capabilities. |
|
|
| ## Benchmarks & Evaluation |
|
|
| The optimization process not only halved the VRAM requirements but also restored benchmark performance to state-of-the-art levels for its size class. |
|
|
| | Benchmark | Scal-lite-60b (Pre-Healing) | Scal-lite-60b (Post-Healing) | |
| |-----------|-----------------------------|------------------------------| |
| | **GSM8K (Math)** | 17.59% | **72.48%** | |
| | **Hellaswag (Common Sense)** | 34.23% | **47.35%** | |
|
|
| ### Real-World Validation: The Kaggle Challenge |
| Tested on a private set of 50 complex algorithmic programming problems: |
| - **Original Hypernova (Baseline):** 9/50 solved. |
| - **Scal-lite-60b-code-math:** **36/50 solved** (when equipped with Python execution tool-use). |
|
|
| ## Hardware Requirements & Deployment |
|
|
| This model is designed to bridge the gap between massive MoEs and accessible hardware. |
| - **Precision (BF16):** ~120 GB VRAM (Recommended: 2x A100 80GB or 4x L40S). |
| - **Quantization (MXFP4):** ~60-65 GB VRAM (Compatible with NVIDIA Blackwell/Hopper architectures). |
| - **Efficiency:** Significant performance-per-watt gains over the original 120B version. |
|
|
| ## Usage (Transformers) |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_id = "your-username/Scal-lite-60b-code-math" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| |
| # Example: Reasoning in Spanish |
| prompt = "Resuelve el siguiente problema: Si una red MoE tiene 128 expertos y podamos el 50%, ¿cuántos expertos quedan y cómo afecta esto a la VRAM?" |
| inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
| |
| output = model.generate(**inputs, max_new_tokens=256) |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) |
| Limitations |
| While Activation-Guided Pruning significantly preserves bilingual skills, some edge-case linguistic nuances may show degradation compared to the 120B original. Users are encouraged to apply context-specific system prompts for best results in non-English languages. |
| ``` |
| Citation & References |
| If you use this model or its pruning methodology, please cite: |
|
|
| Structural Pruning and Optimization in Mixture of Experts (MoE) Models: An Applied Analysis to GPT-OSS-120B. |
|
|
| OpenAI (2025). gpt-oss: Open-Weight Models for Advanced Reasoning. |
|
|
| ICLR Proceedings. "Mixture Compressor for Mixture-of-Experts LLMs Gains More." |