🚀 ScaLite-60B-Coder by SCALAI

ScaLite-60B-Coder is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.

Engineered by SCALAI, this model was surgically distilled from OpenAI's dense 117B parameter MoE (gpt-oss-120b) down to a 60B active parameter footprint. Quantized in MXFP4, ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU with ample room for large KV-caches in production environments.

🧠 Model Details

Developer: SCALAI
Model Type: Pruned Mixture-of-Experts (MoE) Causal Language Model
Base Model: openai/gpt-oss-120b (128 experts)
Pruned Architecture: 60B active parameters (64 experts)
Quantization: MXFP4
Context Length: 8,192 tokens (optimized for complex Chain-of-Thought)
Languages: English, Spanish

📊 Benchmark Performance

ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to outperform its 120B parent in Computer Science.

MMLU-Pro (5-shot) - The Domain Specialist Effect

Category	Original 120B Dense	ScaLite-60B-Coder	Delta
Computer Science	39.02%	49.27%	🟢 +10.25%
Math	53.29%	47.52%	🟡 -5.77%
Law	42.69%	14.90%	🔴 -27.79% (Pruned)
History	40.68%	17.06%	🔴 -23.62% (Pruned)

GSM8K (Math Reasoning)

Pre-Healing (Pruned): 17.59%
Post-Healing (Cross-Domain Code): 61.03% (Zero math data leakage)

🎯 Intended Use & Limitations

Use Cases: Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
Limitations: Do not use this model for encyclopedic trivia, historical facts, or legal advice. The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.

💻 How to Use (vLLM on a single L40S)

ScaLite-60B-Coder is optimized for the vLLM engine. Because the model features an extensive Chain-of-Thought, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.

from vllm import LLM, SamplingParams

model_id = "SCALAI/ScaLite-60B-Coder"

# Load the model optimized for a single 48GB GPU (e.g., L40S)
llm = LLM(
    model=model_id,
    tensor_parallel_size=1,
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.90,
    enforce_eager=True, # Disables CUDA graphs to save memory
    max_num_seqs=16,    # Limits pre-allocated KV cache 
    trust_remote_code=True
)

# Allow enough max_tokens for complex reasoning and CoT
sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    max_tokens=3500 
)

messages = [
    {"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
    {"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
]

prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt

Downloads last month: 3

Safetensors

Model size

61B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Scalai/scal-lite-60b-code

Base model

openai/gpt-oss-120b

Quantized

(112)

this model

Quantizations

1 model

Scalai
/

scal-lite-60b-code