File size: 3,616 Bytes
e951e23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
036f667
e951e23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00dbee7
e951e23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: apache-2.0
language:
- en
- es
tags:
- moe
- code
- math
- vllm
- mxfp4
- pruning
- scalai
datasets:
- HuggingFaceH4/CodeAlpaca_20K
base_model: openai/gpt-oss-120b
---

# 🚀 ScaLite-60B-Coder by SCALAI

**ScaLite-60B-Coder** is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics. 

Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, **ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU** with ample room for large KV-caches in production environments.

## 🧠 Model Details
* **Developer:** SCALAI
* **Model Type:** Pruned Mixture-of-Experts (MoE) Causal Language Model
* **Base Model:** `openai/gpt-oss-120b` (128 experts)
* **Pruned Architecture:** 60B active parameters (64 experts)
* **Quantization:** MXFP4
* **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
* **Languages:** English, Spanish

## 📊 Benchmark Performance
ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.

### MMLU-Pro (5-shot) - The Domain Specialist Effect
| Category | Original 120B Dense | **ScaLite-60B-Coder** | Delta |
| :--- | :--- | :--- | :--- |
| **Computer Science** | 39.02% | **49.27%** | 🟢 **+10.25%** |
| Math | 53.29% | 47.52% | 🟡 -5.77% |
| Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) |
| History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |

### GSM8K (Math Reasoning)
* **Pre-Healing (Pruned):** 17.59%
* **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*

## 🎯 Intended Use & Limitations
* **Use Cases:** Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
* **Limitations:** **Do not use this model for encyclopedic trivia, historical facts, or legal advice.** The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint. 

## 💻 How to Use (vLLM on a single L40S)
ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive *Chain-of-Thought*, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.

```python
from vllm import LLM, SamplingParams

model_id = "SCALAI/ScaLite-60B-Coder"

# Load the model optimized for a single 48GB GPU (e.g., L40S)
llm = LLM(
    model=model_id,
    tensor_parallel_size=1,
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    enforce_eager=True, # Disables CUDA graphs to save memory
    max_num_seqs=16,    # Limits pre-allocated KV cache 
    trust_remote_code=True
)

# Allow enough max_tokens for complex reasoning and CoT
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=3500 
)

messages = [
    {"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
    {"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
]

prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt