| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | - es |
| | tags: |
| | - moe |
| | - code |
| | - math |
| | - vllm |
| | - mxfp4 |
| | - pruning |
| | - scalai |
| | datasets: |
| | - HuggingFaceH4/CodeAlpaca_20K |
| | base_model: openai/gpt-oss-120b |
| | --- |
| | |
| | # 🚀 ScaLite-60B-Coder by SCALAI |
| |
|
| | **ScaLite-60B-Coder** is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics. |
| |
|
| | Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, **ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU** with ample room for large KV-caches in production environments. |
| |
|
| | ## 🧠 Model Details |
| | * **Developer:** SCALAI |
| | * **Model Type:** Pruned Mixture-of-Experts (MoE) Causal Language Model |
| | * **Base Model:** `openai/gpt-oss-120b` (128 experts) |
| | * **Pruned Architecture:** 60B active parameters (64 experts) |
| | * **Quantization:** MXFP4 |
| | * **Context Length:** 8,192 tokens (optimized for complex Chain-of-Thought) |
| | * **Languages:** English, Spanish |
| |
|
| | ## 📊 Benchmark Performance |
| | ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to **outperform its 120B parent in Computer Science**. |
| |
|
| | ### MMLU-Pro (5-shot) - The Domain Specialist Effect |
| | | Category | Original 120B Dense | **ScaLite-60B-Coder** | Delta | |
| | | :--- | :--- | :--- | :--- | |
| | | **Computer Science** | 39.02% | **49.27%** | 🟢 **+10.25%** | |
| | | Math | 53.29% | 47.52% | 🟡 -5.77% | |
| | | Law | 42.69% | 14.90% | 🔴 -27.79% (Pruned) | |
| | | History | 40.68% | 17.06% | 🔴 -23.62% (Pruned) | |
| |
|
| | ### GSM8K (Math Reasoning) |
| | * **Pre-Healing (Pruned):** 17.59% |
| | * **Post-Healing (Cross-Domain Code):** **61.03%** *(Zero math data leakage)* |
| |
|
| | ## 🎯 Intended Use & Limitations |
| | * **Use Cases:** Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving. |
| | * **Limitations:** **Do not use this model for encyclopedic trivia, historical facts, or legal advice.** The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint. |
| |
|
| | ## 💻 How to Use (vLLM on a single L40S) |
| | ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive *Chain-of-Thought*, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards. |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | |
| | model_id = "SCALAI/ScaLite-60B-Coder" |
| | |
| | # Load the model optimized for a single 48GB GPU (e.g., L40S) |
| | llm = LLM( |
| | model=model_id, |
| | tensor_parallel_size=1, |
| | dtype="bfloat16", |
| | max_model_len=4096, |
| | gpu_memory_utilization=0.90, |
| | enforce_eager=True, # Disables CUDA graphs to save memory |
| | max_num_seqs=16, # Limits pre-allocated KV cache |
| | trust_remote_code=True |
| | ) |
| | |
| | # Allow enough max_tokens for complex reasoning and CoT |
| | sampling_params = SamplingParams( |
| | temperature=0.6, |
| | top_p=0.95, |
| | max_tokens=3500 |
| | ) |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."}, |
| | {"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."} |
| | ] |
| | |
| | prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt |