๐๏ธ Smol AI WorldCup: A 5-Axis Benchmark That Reveals What Small Language Models Can Really Do
A 4B model outperforms 8B. A 1.5GB MoE achieves Champions-level quality. A 1.7B beats three 7โ14B models. These aren't hypothetical claims โ they're measured results from 18 models, 12 makers, 125 questions, and 7 languages.
We introduce Smol AI WorldCup, the first benchmark designed specifically for the deployment realities of small language models. Instead of asking only "how smart is it?", we ask five questions simultaneously: How smart? How honest? How fast? How small? How efficient?
The result is SHIFT โ a 5-axis evaluation framework โ and WCS (WorldCup Score) โ a composite metric that rewards models achieving both high quality and high efficiency.
| ๐๏ธ Live Leaderboard | huggingface.co/spaces/ginigen-ai/smol-worldcup |
| ๐ Dataset (125Q, Apache 2.0) | huggingface.co/datasets/ginigen-ai/smol-worldcup |
| ๐ ALL Bench Leaderboard | huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard |
The Problem: Existing Benchmarks Weren't Built for Edge AI
MMLU, GPQA, and HumanEval apply the same test to a 0.5B model and a 500B model. This approach has three fundamental blind spots when it comes to small model deployment:
1. They measure only intelligence. An engineer deploying AI on a smartphone needs to know not just "MMLU 70" but whether the model fits in 2GB, how often it fabricates information, and how many tokens it generates per second.
2. They miss small-model-specific failure modes. Our measurements show that a 1.3B model fabricates confident fake content about nonexistent people, papers, and products 80% of the time when prompted. This risk is invisible to traditional benchmarks.
3. They don't reward efficiency. A 14B model scoring higher than a 4B model is expected. But the fact that "4B achieves 95% of 14B's quality at 36% of the RAM" is lost in a single-column leaderboard. In the edge AI era, performance per unit of resource is what matters.
SHIFT: A 5-Axis Framework for What Actually Matters
SHIFT decomposes the deployment value of a small language model into five measurable axes:
| Axis | Meaning | How Measured | Questions |
|---|---|---|---|
| S โ Size | Model footprint | Parameter count, active params (MoE) | Metadata |
| H โ Honesty | Hallucination resistance, calibration, refusal balance | 40 questions, fully auto-graded | 40 |
| I โ Intelligence | Reasoning, math, coding, 7 languages, metacognition | 85 questions, auto + LLM judge | 85 |
| F โ Fast | Inference throughput | tok/s measured via HF Inference API | 15 samples |
| T โ Thrift | Resource consumption | Peak VRAM/RAM at Q4 quantization | Metadata |
Honesty (H) โ 40 Questions
Small models fail differently than large models. The most dangerous failure is confident fabrication. We test four dimensions:
| Category | Questions | Method | What It Catches |
|---|---|---|---|
| H1 โ Hallucination Trap | 10 | json_field_check | Presents fake entities; model must refuse to fabricate |
| H2 โ Confidence Calibration | 10 | calibration_check | Stated confidence must match actual accuracy |
| H3 โ Refusal Balance | 10 | refusal_check | Penalizes both over-refusal and under-refusal |
| H4 โ Self-Correction | 10 | self_correction_check | Model must catch and fix its own reasoning errors |
Intelligence (I) โ 85 Questions
| Category | Questions | Method | Coverage |
|---|---|---|---|
| I1 โ Reasoning | 15 | answer_match | Syllogisms, puzzles, pattern recognition |
| I2 โ Math | 10 | numeric_match | Arithmetic through compound interest |
| I3 โ Coding | 10 | code_execution | Python functions with executable test cases |
| I4 โ Multilingual | 35 | llm_judge | 7 languages (KO, AR, PT, TR, BN, TH), 2.7B+ speakers |
| I5 โ Knowledge Synthesis | 10 | llm_judge | Constrained explanations, critical thinking |
| I6 โ Metacognition | 5 | llm_judge | Self-awareness, knowledge boundaries |
All 125 questions require mandatory JSON output with verifiable fields. Grading uses structured field validation โ not keyword matching. 75 of 125 questions are fully automatic with zero human intervention.
WCS: The Official Ranking Metric
Why Not Just Use SHIFT or PIR Alone?
A small-model benchmark must measure two things simultaneously:
- Quality: "How well does it perform?" โ SHIFT score
- Efficiency: "How much resource does it consume?" โ PIR score
Using SHIFT alone, a 14B model naturally ranks above 1.7B โ which tells us nothing new. Using PIR alone, a 1.3B model with mediocre quality becomes #1 โ which is misleading. Both must be high for a ranking to be meaningful.
The Formula
WCS = โ( SHIFT ร PIR_norm )
Where:
SHIFT = H ร 0.4 + I ร 0.6 (quality: honesty 40%, intelligence 60%)
PIR = (I ร H ร F) รท (S ร T) (efficiency: output per unit resource)
PIR_norm = logโโ(PIR) / logโโ(max) ร 100 (normalized to 0โ100)
Why geometric mean? Arithmetic mean (A+B)/2 allows one strong axis to compensate for a weak one. Geometric mean โ(AรB) requires both to be strong โ a model that's smart but massive, or tiny but poor, ranks low under WCS.
Why log-normalize PIR? Raw PIR ranges from 18 to 6,952 โ a 386ร spread. Logโโ compression maps this to 0โ100, making it commensurable with SHIFT on the same scale.
Football League Tiers
Models are classified by actual runtime RAM (Q4 quantized), not raw parameter count. This correctly handles MoE models where total params โ active params.
| League | RAM | Target Hardware | Season 1 Champion |
|---|---|---|---|
| ๐ฅ League One | < 2GB | Raspberry Pi, IoT | GPT-OSS-20B (WCS 82.6) |
| โฝ La Liga | 2โ4GB | Smartphone | Gemma-3n-E4B (WCS 81.8) |
| ๐ Premier League | 4โ8GB | Laptop (8GB) | Qwen3-8B (WCS 72.8) |
| ๐ Champions League | 8โ16GB | Desktop PC | Llama-4-Scout (WCS 79.3) |
Season 1 Results: 18 Models, 12 Makers
All scores measured via HuggingFace Inference API across 8 providers. Speed measured with 5 prompts ร 3 rounds per model. Evaluated March 2026.
| Rank | Model | Maker | WCS | SHIFT | PIR | โก tok/s | League |
|---|---|---|---|---|---|---|---|
| ๐ | GPT-OSS-20B | OpenAI | 82.6 | 76.9 | 2,586 | 71.9 | ๐ฅ |
| ๐ฅ | Gemma-3n-E4B | 81.8 | 77.3 | 2,136 | 43.8 | โฝ | |
| ๐ฅ | Llama-4-Scout | Meta | 79.3 | 74.2 | 1,804 | 240.5 | ๐ |
| 4 | Qwen3-4B | Alibaba | 76.6 | 76.8 | 858 | 50.0 | โฝ |
| 5 | Qwen3-1.7B | Alibaba | 76.1 | 66.8 | 2,148 | 30.1 | ๐ฅ |
| 6 | GLM-4.7-Flash ๐ง | Zhipu AI | 73.2 | 74.8 | 566 | 50.8 | โฝ |
| 7 | Qwen3.5-35B-A3B | Alibaba | 72.9 | 75.3 | 517 | 108.7 | ๐ |
| 8 | Qwen3-8B | Alibaba | 72.8 | 76.9 | 445 | 186.8 | ๐ |
| 9 | Llama-3.2-1B | Meta | 70.5 | 49.7 | 6,952 | 113.2 | ๐ฅ |
| 10 | Tiny-Aya-Fire | Cohere | 69.7 | 58.9 | 1,488 | 111.6 | โฝ |
| 11 | Qwen3.5-9B | Alibaba | 67.3 | 71.1 | 280 | 130.6 | ๐ |
| 12 | OLMo-3-7B | AllenAI | 65.5 | 70.2 | 224 | 50.0 | ๐ |
| 13 | DeepSeek-R1-7B ๐ง | DeepSeek | 65.4 | 68.2 | 257 | 69.2 | ๐ |
| 14 | Llama-3.1-8B | Meta | 62.4 | 61.0 | 282 | 187.7 | ๐ |
| 15 | Nemotron-Nano-8B ๐ง | NVIDIA | 58.4 | 65.9 | 98 | 29.8 | ๐ |
| 16 | Gemma-3-12B | 55.0 | 75.7 | 34 | 18.7 | ๐ | |
| 17 | Mistral-7B-v0.2 | Mistral | 53.0 | 60.6 | 60 | 17.8 | ๐ |
| 18 | DeepSeek-R1-14B ๐ง | DeepSeek | 44.2 | 59.8 | 18 | 21.4 | ๐ |
๐ง = Thinking model (uses internal <think> reasoning tokens)
Key Findings
1. 4B Outperforms 8B
Gemma-3n-E4B (4B parameters, 2.0GB RAM) achieves SHIFT 77.3 โ the highest quality score among all 18 models. Qwen3-8B (8B, 5.5GB) scores 76.9. The gap is 0.4 points. The RAM difference is 2.75ร.
This is not an anomaly. Qwen3-4B (4B, 2.8GB) also scores 76.8, essentially matching the 8B model at half the resource cost. The implication is concrete: smartphone-class hardware (2โ3GB) can now run laptop-class AI (5.5GB) with negligible quality loss.
2. The MoE Efficiency Revolution
GPT-OSS-20B is a Mixture-of-Experts model with 21B total parameters but only 3.6B active at inference. It fits in 1.5GB of RAM โ classified as League One (Raspberry Pi tier) โ yet achieves SHIFT 76.9, matching Champions League dense models that require 8โ9GB.
At comparable SHIFT scores (~76), the MoE model uses 5.7ร less RAM than the equivalent dense model (Gemma-3-12B at 8.5GB). This validates MoE as the most promising architecture for edge deployment at scale.
3. Thinking Models: A Double-Edged Sword
Models that use internal reasoning tokens (<think>...</think>) face two penalties in SHIFT evaluation:
Quality penalty: The <think> tags interfere with structured JSON output, which is mandatory for all 125 questions. At comparable sizes:
- Qwen3-8B (non-thinking) โ SHIFT 76.9
- DeepSeek-R1-7B (thinking) โ SHIFT 68.2 (โ8.7 points)
Speed penalty: Internal reasoning generates 2โ6ร more tokens than visible output:
- Qwen3-8B โ 186.8 tok/s
- DeepSeek-R1-7B โ 69.2 tok/s (2.7ร slower)
- Nemotron-Nano-8B โ 29.8 tok/s (6.3ร slower)
This does not mean thinking models are inferior โ DeepSeek-R1-14B's reasoning score (I1 = 96.7) is the highest of any model tested. But for structured output tasks and real-time edge deployment, non-thinking models deliver substantially better practical value.
4. Hallucination Trap: The Most Discriminating Metric
H1 (Hallucination Trap) produces the widest score spread of any single metric: 80 points (from 20 to 100).
The test presents nonexistent people, academic papers, laws, and products, then asks the model to describe them. Models that correctly refuse score high; models that confidently fabricate content score low.
H1 = 100: Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash, Qwen3.5-35B-A3B
H1 = 90: Gemma-3n-E4B, Llama-4-Scout, Qwen3.5-9B
H1 = 80: DeepSeek-R1-7B, Gemma-3-12B
H1 = 70: Llama-3.1-8B, Nemotron-Nano-8B
H1 = 60: Qwen3-1.7B, DeepSeek-R1-14B, OLMo-3-7B
H1 = 40: Mistral-7B-v0.2
H1 = 30: Tiny-Aya-Fire
H1 = 20: Llama-3.2-1B
Two notable patterns emerge. First, the Qwen3 family achieves strong hallucination resistance across all sizes (60โ100 from 1.7B to 35B), suggesting that training pipeline design is more determinative than scale. Second, a 1.3B model fabricates fake content 80% of the time โ a critical safety consideration for IoT and embedded deployment.
5. The 1.7B Rebellion: Architecture Generation Beats Scale
Qwen3-1.7B (1.7B parameters, 1.2GB RAM, released 2025) outscores three significantly larger models:
- Mistral-7B-v0.2 (7.2B, 5.0GB, 2024) โ SHIFT 60.6 (โ6.2 points)
- Llama-3.1-8B (8.0B, 5.5GB, 2024) โ SHIFT 61.0 (โ5.8 points)
- DeepSeek-R1-14B (14.8B, 9.5GB, 2025) โ SHIFT 59.8 (โ7.0 points)
A 1.7B model from the latest generation outperforms models 4โ9ร its size from the previous generation. In the small-model domain, architecture vintage matters more than parameter count.
Union Eval: Direct Comparison with Frontier SOTA
Beyond the 125 SHIFT questions, each model takes a separate Union Eval consisting of 19 cross-benchmark questions. The same questions are administered to frontier-class models, enabling direct comparison:
| Rank | Frontier Model | Union Score |
|---|---|---|
| 1 | Claude Sonnet 4.6 | 69.9 |
| 2 | Claude Opus 4.6 | 69.3 |
| 3 | GPT-5.4 | 62.4 |
| 4 | DeepSeek V3.2 | 60.3 |
| 5 | Qwen3.5-397B | 57.1 |
Best small model: Gemma-3-12B = 57.1 (82% of Claude Sonnet)
A 12B open-source model matches the score of a 397B model on identical questions. This places the current frontier of small-model capability at approximately 80% of the best proprietary systems โ a gap that is closing rapidly.
Research Collaboration & Benchmark Ecosystem
Smol AI WorldCup was developed in collaboration with the FINAL Bench research team. The cross-benchmark question design for Union Eval and the SOTA comparison framework draw on FINAL Bench's evaluation methodology.
The benchmark also integrates with the ALL Bench Leaderboard, creating a complementary evaluation ecosystem:
- Smol AI WorldCup answers: "Where does this 4B model rank among small models?"
- ALL Bench Leaderboard answers: "Where does that 4B model rank against GPT-5 and Claude?"
This cross-benchmark interoperability provides model developers and deployers with a unified view across the full spectrum โ from edge-class small models to frontier-class large models โ within a single evaluation ecosystem. Rather than operating in isolation, both benchmarks share evaluation principles and directional data to support the broader open-source AI community.
Speed Measurement Methodology
All 18 models were measured under identical conditions via the HuggingFace Inference API:
- Warmup call โ First inference excluded (cold start bias removal)
- 5 diverse prompts โ Explanation, Python coding, Korean translation, JSON generation, arithmetic
- 3 rounds โ 15 total samples per model
- Metric:
tok/s = completion_tokens / elapsed_seconds
| Rank | Model | tok/s | Provider |
|---|---|---|---|
| 1 | Llama-4-Scout | 240.5 | Groq |
| 2 | Llama-3.1-8B | 187.7 | Cerebras |
| 3 | Qwen3-8B | 186.8 | Fireworks |
| ... | ... | ... | ... |
| 17 | Mistral-7B-v0.2 | 17.8 | Featherless |
A key observation: provider infrastructure affects speed more than model size. The Groq inference chip runs Llama-4-Scout (17B MoE) at 240 tok/s, while the same-tier dense model on Featherless achieves only 18 tok/s โ a 13ร gap driven entirely by hardware, not architecture. This real-world speed variance is captured in PIR and propagated into WCS.
Season System: Preventing Benchmark Contamination
| Season 1 (Current) | |
|---|---|
| Total questions | 125 |
| Anchor questions | 30 (fixed across seasons for IRT calibration) |
| Rotating questions | 95 (70%+ replaced each season) |
| Union Eval | 19 (undisclosed, rotated seasonally) |
| Period | 2026 Q1 |
| Next season | 2026 Q3 (planned) |
The 30 anchor questions remain identical across seasons for Item Response Theory (IRT) calibration, enabling cross-season comparability. The 95 rotating questions are replaced to prevent benchmark contamination โ a growing concern as models increasingly train on public evaluation data. Previous seasons' rotating questions are released publicly after each season concludes.
Evaluated Models (18)
| # | HuggingFace Repo ID | Display Name | Maker | Arch |
|---|---|---|---|---|
| 1 | meta-llama/Llama-3.2-1B-Instruct |
Llama-3.2-1B | Meta | Dense |
| 2 | Qwen/Qwen3-1.7B |
Qwen3-1.7B | Alibaba | Dense |
| 3 | openai/gpt-oss-20b |
GPT-OSS-20B | OpenAI | MoE ๐ง |
| 4 | CohereLabs/tiny-aya-fire |
Tiny-Aya-Fire | Cohere | Dense |
| 5 | Qwen/Qwen3-4B-Instruct-2507 |
Qwen3-4B | Alibaba | Dense |
| 6 | google/gemma-3n-E4B-it |
Gemma-3n-E4B | PLE | |
| 7 | zai-org/GLM-4.7-Flash |
GLM-4.7-Flash | Zhipu AI | MoE ๐ง |
| 8 | mistralai/Mistral-7B-Instruct-v0.2 |
Mistral-7B-v0.2 | Mistral | Dense |
| 9 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
DeepSeek-R1-7B | DeepSeek | Dense ๐ง |
| 10 | Qwen/Qwen3-8B |
Qwen3-8B | Alibaba | Dense |
| 11 | meta-llama/Llama-3.1-8B-Instruct |
Llama-3.1-8B | Meta | Dense |
| 12 | nvidia/Llama-3.1-Nemotron-Nano-8B-v1 |
Nemotron-Nano-8B | NVIDIA | Dense ๐ง |
| 13 | Qwen/Qwen3.5-9B |
Qwen3.5-9B | Alibaba | Dense |
| 14 | allenai/Olmo-3-7B-Instruct |
OLMo-3-7B | AllenAI | Dense |
| 15 | google/gemma-3-12b-it |
Gemma-3-12B | Dense | |
| 16 | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
DeepSeek-R1-14B | DeepSeek | Dense ๐ง |
| 17 | Qwen/Qwen3.5-35B-A3B |
Qwen3.5-35B-A3B | Alibaba | MoE |
| 18 | meta-llama/Llama-4-Scout-17B-16E-Instruct |
Llama-4-Scout | Meta | MoE |
12 makers represented: Meta, Alibaba, Google, OpenAI, DeepSeek, Mistral, NVIDIA, Cohere, Zhipu AI, AllenAI, Nanbeige (in dataset, future season), and others.
Recommendations by Use Case
| Use Case | Recommended Model | WCS | Rationale |
|---|---|---|---|
| ๐ Overall best | GPT-OSS-20B | 82.6 | Highest WCS โ top tier in both quality and efficiency |
| โญ Highest quality | Gemma-3n-E4B | 81.8 | SHIFT 77.3 (#1 quality) in just 2GB of RAM |
| โก Fastest inference | Llama-4-Scout | 79.3 | 240.5 tok/s โ 13ร faster than comparable dense models |
| ๐ฑ Smartphone deployment | Gemma-3n-E4B | 81.8 | Runs on 4GB phones with quality exceeding most laptop models |
| ๐ง Most trustworthy | Qwen3-8B | 72.8 | Honesty score H=87.9 โ highest of any model tested |
| ๐ฐ Best value | GPT-OSS-20B | 82.6 | Champions-quality AI in 1.5GB โ fits on a Raspberry Pi |
| ๐ฅ๏ธ Closest to SOTA | Gemma-3-12B | 55.0 | Union 57.1 = 82% of Claude Sonnet on identical questions |
| ๐ Best multilingual | Gemma-3n-E4B | 81.8 | I4=65.2 โ highest score across 7 languages |
Conclusion
The Smol AI WorldCup Season 1 evaluation of 18 models across 12 makers yields a clear conclusion:
Architecture and training quality โ not parameter count โ determine practical deployment value.
A 4B model outperforms 8B at 36% of the RAM. A 1.5GB MoE matches Champions-league dense models. A 2025-generation 1.7B outscores three 2024-generation 7โ14B models. These findings challenge the scaling assumption that "bigger is always better" with measured, reproducible evidence.
The SHIFT framework and WCS metric provide the first integrated scoring system that captures all five axes that matter for edge AI deployment. The benchmark's season system, with 70%+ question rotation and IRT-calibrated anchors, is designed for long-term durability against contamination.
We look forward to Season 2 (2026 Q3), which will expand the model roster, rotate the question set, and refine the evaluation methodology based on community feedback.
The dataset is publicly available under Apache 2.0. We welcome submissions of new models for evaluation.
Developed by Ginigen.ai in collaboration with the FINAL Bench research team.
Small but Mighty AI.






