README.md · Scalai/scal-lite-60b-code-math at main

scal-lite-60b-code-math / README.md

Vicens

Update README.md

04f15da verified about 1 month ago

preview code

raw

history blame contribute delete

4.48 kB

	---
	language:
	- en
	- es
	license: apache-2.0
	tags:
	- moe
	- gpt-oss
	- model-compression
	- pruning
	- mixture-of-experts
	- optimization
	- math
	- code
	- logic
	model_name: Scal-lite-60b-code-math
	base_model: openai/gpt-oss-120b
	inference: true
	pipeline_tag: text-generation
	---

	# Scal-lite-60b-code-math

	## Model Summary
	Scal-lite-60b-code-math is a highly efficient, structurally pruned version of the `gpt-oss-120b` Mixture of Experts (MoE) model. Through Activation-Guided Structural Pruning, the model was reduced from 128 to 64 experts, resulting in a 60-billion total parameter architecture (~5.1B active parameters per token).

	Unlike standard magnitude-based pruning, this model preserves critical specialized knowledge in low-frequency domains, such as Spanish language proficiency, advanced LaTeX mathematics, and strict JSON/Python code generation.

	## Technical Methodology: "The Surgery"

	### 1. Activation-Guided Sparsity
	Conventional magnitude pruning (L2 norm) often fails in MoE models because specialized skills (like non-English languages or specific code syntaxes) are often mapped to experts with smaller weight magnitudes.

	To prevent "functional lobotomy," we implemented Activation-Guided Pruning:
	- Forward Hooks: Monitored `mlp.router` activity during stress tests.
	- Utility Ranking: Identified hyper-specialized experts (e.g., Expert #13) essential for Spanish logic and strict syntax.
	- Amputation: Removed the 64 statistically least-used experts per layer based on real-world utility rather than static weight size.

	### 2. Targeted Router Healing
	Post-pruning, the original routing network suffers from "Router Trauma" or probability misalignment. To fix this, we applied a lightweight Targeted Router Healing process:
	- Frozen Experts: Core knowledge weights remained untouched.
	- Trainable Router: Fine-tuned only the gating network for 3,000 steps using the `MetaMathQA` dataset.
	- Result: Successfully recalibrated the model's internal navigation to access its latent reasoning capabilities.

	## Benchmarks & Evaluation

	The optimization process not only halved the VRAM requirements but also restored benchmark performance to state-of-the-art levels for its size class.

	\| Benchmark \| Scal-lite-60b (Pre-Healing) \| Scal-lite-60b (Post-Healing) \|
	\|-----------\|-----------------------------\|------------------------------\|
	\| GSM8K (Math) \| 17.59% \| 72.48% \|
	\| Hellaswag (Common Sense) \| 34.23% \| 47.35% \|

	### Real-World Validation: The Kaggle Challenge
	Tested on a private set of 50 complex algorithmic programming problems:
	- Original Hypernova (Baseline): 9/50 solved.
	- Scal-lite-60b-code-math: 36/50 solved (when equipped with Python execution tool-use).

	## Hardware Requirements & Deployment

	This model is designed to bridge the gap between massive MoEs and accessible hardware.
	- Precision (BF16): ~120 GB VRAM (Recommended: 2x A100 80GB or 4x L40S).
	- Quantization (MXFP4): ~60-65 GB VRAM (Compatible with NVIDIA Blackwell/Hopper architectures).
	- Efficiency: Significant performance-per-watt gains over the original 120B version.

	## Usage (Transformers)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "your-username/Scal-lite-60b-code-math"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Example: Reasoning in Spanish
	prompt = "Resuelve el siguiente problema: Si una red MoE tiene 128 expertos y podamos el 50%, ¿cuántos expertos quedan y cómo afecta esto a la VRAM?"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

	output = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	Limitations
	While Activation-Guided Pruning significantly preserves bilingual skills, some edge-case linguistic nuances may show degradation compared to the 120B original. Users are encouraged to apply context-specific system prompts for best results in non-English languages.
	```
	Citation & References
	If you use this model or its pruning methodology, please cite:

	Structural Pruning and Optimization in Mixture of Experts (MoE) Models: An Applied Analysis to GPT-OSS-120B.

	OpenAI (2025). gpt-oss: Open-Weight Models for Advanced Reasoning.

	ICLR Proceedings. "Mixture Compressor for Mixture-of-Experts LLMs Gains More."