Update README.md

00dbee7 verified 5 days ago

3.62 kB

	---
	license: apache-2.0
	language:
	- en
	- es
	tags:
	- moe
	- code
	- math
	- vllm
	- mxfp4
	- pruning
	- scalai
	datasets:
	- HuggingFaceH4/CodeAlpaca_20K
	base_model: openai/gpt-oss-120b
	---

	# 🚀 ScaLite-60B-Coder by SCALAI

	ScaLite-60B-Coder is a highly specialized, bilingual (English/Spanish) Mixture-of-Experts (MoE) large language model optimized for strict logical reasoning, Python programming, and mathematics.

	Engineered by [SCALAI](https://scalai.es), this model was surgically distilled from OpenAI's dense 117B parameter MoE (`gpt-oss-120b`) down to a 60B active parameter footprint. Quantized in MXFP4, ScaLite-60B-Coder requires only ~30GB of VRAM, making it fully deployable on a single NVIDIA L40S (48GB) GPU with ample room for large KV-caches in production environments.

	## 🧠 Model Details
	* Developer: SCALAI
	* Model Type: Pruned Mixture-of-Experts (MoE) Causal Language Model
	* Base Model: `openai/gpt-oss-120b` (128 experts)
	* Pruned Architecture: 60B active parameters (64 experts)
	* Quantization: MXFP4
	* Context Length: 8,192 tokens (optimized for complex Chain-of-Thought)
	* Languages: English, Spanish

	## 📊 Benchmark Performance
	ScaLite-60B-Coder sheds general trivia to hyper-focus on logic and computation. Incredibly, removing the interference of general-knowledge experts allowed this pruned 60B model to outperform its 120B parent in Computer Science.

	### MMLU-Pro (5-shot) - The Domain Specialist Effect
	\| Category \| Original 120B Dense \| ScaLite-60B-Coder \| Delta \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Computer Science \| 39.02% \| 49.27% \| 🟢 +10.25% \|
	\| Math \| 53.29% \| 47.52% \| 🟡 -5.77% \|
	\| Law \| 42.69% \| 14.90% \| 🔴 -27.79% (Pruned) \|
	\| History \| 40.68% \| 17.06% \| 🔴 -23.62% (Pruned) \|

	### GSM8K (Math Reasoning)
	* Pre-Healing (Pruned): 17.59%
	* Post-Healing (Cross-Domain Code): 61.03% (Zero math data leakage)

	## 🎯 Intended Use & Limitations
	* Use Cases: Asynchronous Python coding, JSON/schema extraction, bilingual logical deduction, and structured step-by-step math problem solving.
	* Limitations: Do not use this model for encyclopedic trivia, historical facts, or legal advice. The experts responsible for memorizing general internet knowledge were intentionally amputated to save 50% of the VRAM footprint.

	## 💻 How to Use (vLLM on a single L40S)
	ScaLite-60B-Coder is optimized for the `vLLM` engine. Because the model features an extensive Chain-of-Thought, we recommend specific settings to prevent Out-Of-Memory (OOM) errors on 48GB cards.

	```python
	from vllm import LLM, SamplingParams

	model_id = "SCALAI/ScaLite-60B-Coder"

	# Load the model optimized for a single 48GB GPU (e.g., L40S)
	llm = LLM(
	model=model_id,
	tensor_parallel_size=1,
	dtype="bfloat16",
	max_model_len=4096,
	gpu_memory_utilization=0.90,
	enforce_eager=True, # Disables CUDA graphs to save memory
	max_num_seqs=16, # Limits pre-allocated KV cache
	trust_remote_code=True
	)

	# Allow enough max_tokens for complex reasoning and CoT
	sampling_params = SamplingParams(
	temperature=0.6,
	top_p=0.95,
	max_tokens=3500
	)

	messages = [
	{"role": "system", "content": "You are a logical, bilingual AI assistant. Think step-by-step."},
	{"role": "user", "content": "Encuentra todos los pares de números enteros (x, y) que satisfacen la siguiente ecuación: x^3 + y^3 = (x+y)^2. Piensa paso a paso y justifica cada deducción lógica."}
	]

	prompt = llm.llm_engine.tokenizer.tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt