Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- es
|
| 6 |
+
tags:
|
| 7 |
+
- moe
|
| 8 |
+
- code
|
| 9 |
+
- math
|
| 10 |
+
- vllm
|
| 11 |
+
- mxfp4
|
| 12 |
+
- pruning
|
| 13 |
+
- scalai
|
| 14 |
+
datasets:
|
| 15 |
+
- HuggingFaceH4/CodeAlpaca_20K
|
| 16 |
+
base_model: openai/gpt-oss-120b
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# 🚀 ScaLite-60B-Coder by SCALAI
|
| 20 |
+
|
| 21 |
+
**ScaLite-60B-Coder** is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.
|
| 22 |
+
|
| 23 |
+
Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, **ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU** with ample room for large KV-caches in production environments.
|
| 24 |
+
|
| 25 |
+
## 🧠 Model Details
|
| 26 |
+
* **Developer:** SCALAI (Vicens Gaitan)
|
| 27 |
+
* **Model Type:** Pruned Mixture-of-Experts (MoE) Causal Language Model
|
| 28 |
+
* **Base Model:** `openai/gpt-oss-120b` (128 experts)
|
| 29 |
+
* **Pruned Architecture:** 60B active parameters (64 experts)
|
| 30 |
+
* **Quantization:** MXFP4
|
| 31 |
+
* **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
|
| 32 |
+
* **Languages:** English, Spanish
|
| 33 |
+
|
| 34 |
+
## 🔬 The Innovation: Activation-Guided Sparsity & Router Healing
|
| 35 |
+
Standard magnitude-based pruning often lobotomizes minority capabilities like secondary languages or strict syntax. To build ScaLite-60B, we pioneered a behavioral approach:
|
| 36 |
+
|
| 37 |
+
1. **Activation-Guided Sparsity:** We injected forward hooks into the original 120B model and passed a bilingual calibration dataset of complex math and code. We tracked actual expert utilization and permanently severed the 64 experts that represented the "encyclopedic long tail," preserving only the structural and logical specialists.
|
| 38 |
+
2. **Cross-Domain Router Healing:** Pruning 50% of the experts causes "Router Trauma" (probability misalignment). Instead of retraining on math (which risks data leakage), we froze the surviving experts and fine-tuned the router for 3,000 steps *exclusively* on Python code (`CodeAlpaca_20K`). This taught the router structural discipline, which miraculously generalized to mathematical reasoning.
|
| 39 |
+
|
| 40 |
+
## 📊 Benchmark Performance
|
| 41 |
+
ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
|
| 42 |
+
|
| 43 |
+
### MMLU-Pro (5-shot) - The Domain Specialist Effect
|
| 44 |
+
| Category | Original 120B Dense | **ScaLite-60B-Coder** | Delta |
|
| 45 |
+
| :--- | :--- | :--- | :--- |
|
| 46 |
+
| **Computer Science** | 39.02% | **49.27%** | 🟢 **+10.25%** |
|
| 47 |
+
| Math | 53.29% | 47.52% | 🟡 -5.77% |
|
| 48 |
+
| Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) |
|
| 49 |
+
| History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
|
| 50 |
+
|
| 51 |
+
### GSM8K (Math Reasoning)
|
| 52 |
+
* **Pre-Healing (Traumatized):** 17.59%
|
| 53 |
+
* **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
|
| 54 |
+
|
| 55 |
+
## 🎯 Intended Use & Limitations
|
| 56 |
+
* **Use Cases:** Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
|
| 57 |
+
* **Limitations:** **Do not use this model for encyclopedic trivia, historical facts, or legal advice.** The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.
|
| 58 |
+
|
| 59 |
+
## 💻 How to Use (vLLM on a single L40S)
|
| 60 |
+
ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive *Chain-of-Thought*, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
from vllm import LLM, SamplingParams
|
| 64 |
+
|
| 65 |
+
model_id = "SCALAI/ScaLite-60B-Coder"
|
| 66 |
+
|
| 67 |
+
# Load the model optimized for a single 48GB GPU (e.g., L40S)
|
| 68 |
+
llm = LLM(
|
| 69 |
+
model=model_id,
|
| 70 |
+
tensor_parallel_size=1,
|
| 71 |
+
dtype="bfloat16",
|
| 72 |
+
max_model_len=4096,
|
| 73 |
+
gpu_memory_utilization=0.90,
|
| 74 |
+
enforce_eager=True, # Disables CUDA graphs to save memory
|
| 75 |
+
max_num_seqs=16, # Limits pre-allocated KV cache
|
| 76 |
+
trust_remote_code=True
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
# Allow enough max_tokens for complex reasoning and CoT
|
| 80 |
+
sampling_params = SamplingParams(
|
| 81 |
+
temperature=0.6,
|
| 82 |
+
top_p=0.95,
|
| 83 |
+
max_tokens=3500
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
messages = [
|
| 87 |
+
{"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
|
| 88 |
+
{"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
|
| 89 |
+
]
|
| 90 |
+
|
| 91 |
+
prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
|
| 92 |
+
messages, tokenize=False, add_generation_prompt
|