README.md · Scalai/scal-lite-60b-code-math at main

File size: 4,481 Bytes

29ed8e3
4222f5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29ed8e3
4222f5c
29ed8e3
4222f5c
 
29ed8e3
4222f5c
29ed8e3
4222f5c
29ed8e3
4222f5c
 
29ed8e3
4222f5c
 
 
 
29ed8e3
4222f5c
 
 
 
 
29ed8e3
4222f5c
29ed8e3
4222f5c
29ed8e3
4222f5c
 
 
 
29ed8e3
4222f5c
 
 
 
29ed8e3
4222f5c
29ed8e3
4222f5c
 
 
 
29ed8e3
4222f5c
29ed8e3
4222f5c
 
 
29ed8e3
4222f5c
29ed8e3
4222f5c
 
 
 
 
 
29ed8e3
4222f5c
 
 
29ed8e3
4222f5c
 
 
 
778af09
4222f5c
 
29ed8e3
4222f5c
29ed8e3
4222f5c
29ed8e3
4222f5c

---
language:
- en
- es
license: apache-2.0
tags:
- moe
- gpt-oss
- model-compression
- pruning
- mixture-of-experts
- optimization
- math
- code
- logic
model_name: Scal-lite-60b-code-math
base_model: openai/gpt-oss-120b
inference: true
pipeline_tag: text-generation
---

# Scal-lite-60b-code-math

## Model Summary
**Scal-lite-60b-code-math** is a highly efficient, structurally pruned version of the `gpt-oss-120b` Mixture of Experts (MoE) model. Through **Activation-Guided Structural Pruning**, the model was reduced from 128 to 64 experts, resulting in a 60-billion total parameter architecture (~5.1B active parameters per token). 

Unlike standard magnitude-based pruning, this model preserves critical specialized knowledge in low-frequency domains, such as **Spanish language proficiency**, **advanced LaTeX mathematics**, and **strict JSON/Python code generation**.

## Technical Methodology: "The Surgery"

### 1. Activation-Guided Sparsity
Conventional magnitude pruning (L2 norm) often fails in MoE models because specialized skills (like non-English languages or specific code syntaxes) are often mapped to experts with smaller weight magnitudes. 

To prevent "functional lobotomy," we implemented **Activation-Guided Pruning**:
- **Forward Hooks:** Monitored `mlp.router` activity during stress tests.
- **Utility Ranking:** Identified hyper-specialized experts (e.g., Expert #13) essential for Spanish logic and strict syntax.
- **Amputation:** Removed the 64 statistically least-used experts per layer based on real-world utility rather than static weight size.

### 2. Targeted Router Healing
Post-pruning, the original routing network suffers from "Router Trauma" or probability misalignment. To fix this, we applied a lightweight **Targeted Router Healing** process:
- **Frozen Experts:** Core knowledge weights remained untouched.
- **Trainable Router:** Fine-tuned only the gating network for 3,000 steps using the `MetaMathQA` dataset.
- **Result:** Successfully recalibrated the model's internal navigation to access its latent reasoning capabilities.

## Benchmarks & Evaluation

The optimization process not only halved the VRAM requirements but also restored benchmark performance to state-of-the-art levels for its size class.

| Benchmark | Scal-lite-60b (Pre-Healing) | Scal-lite-60b (Post-Healing) |
|-----------|-----------------------------|------------------------------|
| **GSM8K (Math)** | 17.59% | **72.48%** |
| **Hellaswag (Common Sense)** | 34.23% | **47.35%** |

### Real-World Validation: The Kaggle Challenge
Tested on a private set of 50 complex algorithmic programming problems:
- **Original Hypernova (Baseline):** 9/50 solved.
- **Scal-lite-60b-code-math:** **36/50 solved** (when equipped with Python execution tool-use).

## Hardware Requirements & Deployment

This model is designed to bridge the gap between massive MoEs and accessible hardware.
- **Precision (BF16):** ~120 GB VRAM (Recommended: 2x A100 80GB or 4x L40S).
- **Quantization (MXFP4):** ~60-65 GB VRAM (Compatible with NVIDIA Blackwell/Hopper architectures).
- **Efficiency:** Significant performance-per-watt gains over the original 120B version.

## Usage (Transformers)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "your-username/Scal-lite-60b-code-math"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example: Reasoning in Spanish
prompt = "Resuelve el siguiente problema: Si una red MoE tiene 128 expertos y podamos el 50%, ¿cuántos expertos quedan y cómo afecta esto a la VRAM?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Limitations
While Activation-Guided Pruning significantly preserves bilingual skills, some edge-case linguistic nuances may show degradation compared to the 120B original. Users are encouraged to apply context-specific system prompts for best results in non-English languages.
```
Citation & References
If you use this model or its pruning methodology, please cite:

Structural Pruning and Optimization in Mixture of Experts (MoE) Models: An Applied Analysis to GPT-OSS-120B.

OpenAI (2025). gpt-oss: Open-Weight Models for Advanced Reasoning.

ICLR Proceedings. "Mixture Compressor for Mixture-of-Experts LLMs Gains More."