🚀 ScaLite-60B-Coder by SCALAI
ScaLite-60B-Coder is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.
Engineered by SCALAI, this model was surgically distilled from OpenAI's dense 117B parameter MoE (gpt-oss-120b) down to a 60B active parameter footprint. Quantized in MXFP4, ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU with ample room for large KV-caches in production environments.
🧠 Model Details
- Developer: SCALAI
- Model Type: Pruned Mixture-of-Experts (MoE) Causal Language Model
- Base Model:
openai/gpt-oss-120b(128 experts) - Pruned Architecture: 60B active parameters (64 experts)
- Quantization: MXFP4
- Context Length: 8,192 tokens (optimized for complex Chain-of-Thought)
- Languages: English, Spanish
📊 Benchmark Performance
ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to outperform its 120B parent in Computer Science.
MMLU-Pro (5-shot) - The Domain Specialist Effect
| Category | Original 120B Dense | ScaLite-60B-Coder | Delta |
|---|---|---|---|
| Computer Science | 39.02% | 49.27% | 🟢 +10.25% |
| Math | 53.29% | 47.52% | 🟡 -5.77% |
| Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) |
| History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) |
GSM8K (Math Reasoning)
- Pre-Healing (Pruned): 17.59%
- Post-Healing (Cross-Domain Code): 61.03% (Zero math data leakage)
🎯 Intended Use & Limitations
- Use Cases: Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
- Limitations: Do not use this model for encyclopedic trivia, historical facts, or legal advice. The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.
💻 How to Use (vLLM on a single L40S)
ScaLite-60B-Coder is optimized for the vLLM engine. Because the model features an extensive Chain-of-Thought, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.
from vllm import LLM, SamplingParams
model_id = "SCALAI/ScaLite-60B-Coder"
# Load the model optimized for a single 48GB GPU (e.g., L40S)
llm = LLM(
model=model_id,
tensor_parallel_size=1,
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.90,
enforce_eager=True, # Disables CUDA graphs to save memory
max_num_seqs=16, # Limits pre-allocated KV cache
trust_remote_code=True
)
# Allow enough max_tokens for complex reasoning and CoT
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
max_tokens=3500
)
messages = [
{"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
{"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
]
prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt
- Downloads last month
- -