🚀 ScaLite-60B-Coder by SCALAI

ScaLite-60B-Coder is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.

Engineered by SCALAI, this model was surgically distilled from OpenAI's dense 117B parameter MoE (gpt-oss-120b) down to a 60B active parameter footprint. Quantized in MXFP4, ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU with ample room for large KV-caches in production environments.

🧠 Model Details

  • Developer: SCALAI
  • Model Type: Pruned Mixture-of-Experts (MoE) Causal Language Model
  • Base Model: openai/gpt-oss-120b (128 experts)
  • Pruned Architecture: 60B active parameters (64 experts)
  • Quantization: MXFP4
  • Context Length: 8,192 tokens (optimized for complex Chain-of-Thought)
  • Languages: English, Spanish

📊 Benchmark Performance

ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to outperform its 120B parent in Computer Science.

MMLU-Pro (5-shot) - The Domain Specialist Effect

Category Original 120B Dense ScaLite-60B-Coder Delta
Computer Science 39.02% 49.27% 🟢 +10.25%
Math 53.29% 47.52% 🟡 -5.77%
Law 42.69% 14.90% 🔴 -27.79% (Pruned)
History 40.68% 17.06% 🔴 -23.62% (Pruned)

GSM8K (Math Reasoning)

  • Pre-Healing (Pruned): 17.59%
  • Post-Healing (Cross-Domain Code): 61.03% (Zero math data leakage)

🎯 Intended Use & Limitations

  • Use Cases: Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
  • Limitations: Do not use this model for encyclopedic trivia, historical facts, or legal advice. The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.

💻 How to Use (vLLM on a single L40S)

ScaLite-60B-Coder is optimized for the vLLM engine. Because the model features an extensive Chain-of-Thought, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.

from vllm import LLM, SamplingParams

model_id = "SCALAI/ScaLite-60B-Coder"

# Load the model optimized for a single 48GB GPU (e.g., L40S)
llm = LLM(
    model=model_id,
    tensor_parallel_size=1,
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    enforce_eager=True, # Disables CUDA graphs to save memory
    max_num_seqs=16,    # Limits pre-allocated KV cache 
    trust_remote_code=True
)

# Allow enough max_tokens for complex reasoning and CoT
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=3500 
)

messages = [
    {"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
    {"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
]

prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt
Downloads last month
-
Safetensors
Model size
61B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Scalai/scal-lite-60b-code

Quantized
(79)
this model
Quantizations
1 model

Dataset used to train Scalai/scal-lite-60b-code