scal-lite-60b-code / README.md
Vicens's picture
Update README.md
00dbee7 verified
---
license: apache-2.0
language:
- en
- es
tags:
- moe
- code
- math
- vllm
- mxfp4
- pruning
- scalai
datasets:
- HuggingFaceH4/CodeAlpaca_20K
base_model: openai/gpt-oss-120b
---
# 🚀 ScaLite-60B-Coder by SCALAI
**ScaLite-60B-Coder** is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.
Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, **ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU** with ample room for large KV-caches in production environments.
## 🧠 Model Details
* **Developer:** SCALAI
* **Model Type:** Pruned Mixture-of-Experts (MoE) Causal Language Model
* **Base Model:** `openai/gpt-oss-120b` (128 experts)
* **Pruned Architecture:** 60B active parameters (64 experts)
* **Quantization:** MXFP4
* **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought)
* **Languages:** English, Spanish
## 📊 Benchmark Performance
ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**.
### MMLU-Pro (5-shot) - The Domain Specialist Effect
| Category | Original 120B Dense | **ScaLite-60B-Coder** | Delta |
| :--- | :--- | :--- | :--- |
| **Computer Science** | 39.02% | **49.27%** | 🟢 **+10.25%** |
| Math | 53.29% | 47.52% | 🟡 -5.77% |
| Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) |
| History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
### GSM8K (Math Reasoning)
* **Pre-Healing (Pruned):** 17.59%
* **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)*
## 🎯 Intended Use & Limitations
* **Use Cases:** Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
* **Limitations:** **Do not use this model for encyclopedic trivia, historical facts, or legal advice.** The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.
## 💻 How to Use (vLLM on a single L40S)
ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive *Chain-of-Thought*, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.
```python
from vllm import LLM, SamplingParams
model_id = "SCALAI/ScaLite-60B-Coder"
# Load the model optimized for a single 48GB GPU (e.g., L40S)
llm = LLM(
model=model_id,
tensor_parallel_size=1,
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.90,
enforce_eager=True, # Disables CUDA graphs to save memory
max_num_seqs=16, # Limits pre-allocated KV cache
trust_remote_code=True
)
# Allow enough max_tokens for complex reasoning and CoT
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
max_tokens=3500
)
messages = [
{"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
{"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
]
prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt