--- license: apache-2.0 language: - en - es tags: - moe - code - math - vllm - mxfp4 - pruning - scalai datasets: - HuggingFaceH4/CodeAlpaca_20K base_model: openai/gpt-oss-120b --- # 🚀 ScaLite-60B-Coder by SCALAI **ScaLite-60B-Coder** is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics. Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, **ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU** with ample room for large KV-caches in production environments. ## 🧠 Model Details * **Developer:** SCALAI * **Model Type:** Pruned Mixture-of-Experts (MoE) Causal Language Model * **Base Model:** `openai/gpt-oss-120b` (128 experts) * **Pruned Architecture:** 60B active parameters (64 experts) * **Quantization:** MXFP4 * **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought) * **Languages:** English, Spanish ## 📊 Benchmark Performance ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**. ### MMLU-Pro (5-shot) - The Domain Specialist Effect | Category | Original 120B Dense | **ScaLite-60B-Coder** | Delta | | :--- | :--- | :--- | :--- | | **Computer Science** | 39.02% | **49.27%** | 🟢 **+10.25%** | | Math | 53.29% | 47.52% | 🟡 -5.77% | | Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) | | History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) | ### GSM8K (Math Reasoning) * **Pre-Healing (Pruned):** 17.59% * **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)* ## 🎯 Intended Use & Limitations * **Use Cases:** Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving. * **Limitations:** **Do not use this model for encyclopedic trivia, historical facts, or legal advice.** The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint. ## 💻 How to Use (vLLM on a single L40S) ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive *Chain-of-Thought*, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards. ```python from vllm import LLM, SamplingParams model_id = "SCALAI/ScaLite-60B-Coder" # Load the model optimized for a single 48GB GPU (e.g., L40S) llm = LLM( model=model_id, tensor_parallel_size=1, dtype="bfloat16", max_model_len=4096, gpu_memory_utilization=0.90, enforce_eager=True, # Disables CUDA graphs to save memory max_num_seqs=16, # Limits pre-allocated KV cache trust_remote_code=True ) # Allow enough max_tokens for complex reasoning and CoT sampling_params = SamplingParams( temperature=0.6, top_p=0.95, max_tokens=3500 ) messages = [ {"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."}, {"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."} ] prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt