Vicens commited on
Commit
e951e23
·
verified ·
1 Parent(s): 04d83ca

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ tags:
7
+ - moe
8
+ - code
9
+ - math
10
+ - vllm
11
+ - mxfp4
12
+ - pruning
13
+ - scalai
14
+ datasets:
15
+ - HuggingFaceH4/CodeAlpaca_20K
16
+ base_model: openai/gpt-oss-120b
17
+ ---
18
+
19
+ # 🚀 ScaLite-60B-Coder by SCALAI
20
+
21
+ **ScaLite-60B-Coder** is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.
22
+
23
+ Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, **ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU** with ample room for large KV-caches in production environments.
24
+
25
+ ## 🧠 Model Details
26
+ * **Developer:** SCALAI (Vicens Gaitan)
27
+ * **Model Type:** Pruned Mixture-of-Experts (MoE) Causal Language Model
28
+ * **Base Model:** `openai/gpt-oss-120b` (128 experts)
29
+ * **Pruned Architecture:** 60B active parameters (64 experts)
30
+ * **Quantization:** MXFP4
31
+ * **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
32
+ * **Languages:** English, Spanish
33
+
34
+ ## 🔬 The Innovation: Activation-Guided Sparsity & Router Healing
35
+ Standard magnitude-based pruning often lobotomizes minority capabilities like secondary languages or strict syntax. To build ScaLite-60B, we pioneered a behavioral approach:
36
+
37
+ 1. **Activation-Guided Sparsity:** We injected forward hooks into the original 120B model and passed a bilingual calibration dataset of complex math and code. We tracked actual expert utilization and permanently severed the 64 experts that represented the "encyclopedic long tail," preserving only the structural and logical specialists.
38
+ 2. **Cross-Domain Router Healing:** Pruning 50% of the experts causes "Router Trauma" (probability misalignment). Instead of retraining on math (which risks data leakage), we froze the surviving experts and fine-tuned the router for 3,000 steps *exclusively* on Python code (`CodeAlpaca_20K`). This taught the router structural discipline, which miraculously generalized to mathematical reasoning.
39
+
40
+ ## 📊 Benchmark Performance
41
+ ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
42
+
43
+ ### MMLU-Pro (5-shot) - The Domain Specialist Effect
44
+ | Category | Original 120B Dense | **ScaLite-60B-Coder** | Delta |
45
+ | :--- | :--- | :--- | :--- |
46
+ | **Computer Science** | 39.02% | **49.27%** | 🟢 **+10.25%** |
47
+ | Math | 53.29% | 47.52% | 🟡 -5.77% |
48
+ | Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) |
49
+ | History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
50
+
51
+ ### GSM8K (Math Reasoning)
52
+ * **Pre-Healing (Traumatized):** 17.59%
53
+ * **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
54
+
55
+ ## 🎯 Intended Use & Limitations
56
+ * **Use Cases:** Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
57
+ * **Limitations:** **Do not use this model for encyclopedic trivia, historical facts, or legal advice.** The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.
58
+
59
+ ## 💻 How to Use (vLLM on a single L40S)
60
+ ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive *Chain-of-Thought*, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.
61
+
62
+ ```python
63
+ from vllm import LLM, SamplingParams
64
+
65
+ model_id = "SCALAI/ScaLite-60B-Coder"
66
+
67
+ # Load the model optimized for a single 48GB GPU (e.g., L40S)
68
+ llm = LLM(
69
+ model=model_id,
70
+ tensor_parallel_size=1,
71
+ dtype="bfloat16",
72
+ max_model_len=4096,
73
+ gpu_memory_utilization=0.90,
74
+ enforce_eager=True, # Disables CUDA graphs to save memory
75
+ max_num_seqs=16, # Limits pre-allocated KV cache
76
+ trust_remote_code=True
77
+ )
78
+
79
+ # Allow enough max_tokens for complex reasoning and CoT
80
+ sampling_params = SamplingParams(
81
+ temperature=0.6,
82
+ top_p=0.95,
83
+ max_tokens=3500
84
+ )
85
+
86
+ messages = [
87
+ {"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
88
+ {"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
89
+ ]
90
+
91
+ prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
92
+ messages, tokenize=False, add_generation_prompt