๐ŸŸ๏ธ Smol AI WorldCup: A 5-Axis Benchmark That Reveals What Small Language Models Can Really Do

Community Article Published March 10, 2026

A 4B model outperforms 8B. A 1.5GB MoE achieves Champions-level quality. A 1.7B beats three 7โ€“14B models. These aren't hypothetical claims โ€” they're measured results from 18 models, 12 makers, 125 questions, and 7 languages.

We introduce Smol AI WorldCup, the first benchmark designed specifically for the deployment realities of small language models. Instead of asking only "how smart is it?", we ask five questions simultaneously: How smart? How honest? How fast? How small? How efficient?

The result is SHIFT โ€” a 5-axis evaluation framework โ€” and WCS (WorldCup Score) โ€” a composite metric that rewards models achieving both high quality and high efficiency.

๐ŸŸ๏ธ Live Leaderboard huggingface.co/spaces/ginigen-ai/smol-worldcup
๐Ÿ“Š Dataset (125Q, Apache 2.0) huggingface.co/datasets/ginigen-ai/smol-worldcup
๐Ÿ… ALL Bench Leaderboard huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard

1

2

The Problem: Existing Benchmarks Weren't Built for Edge AI

MMLU, GPQA, and HumanEval apply the same test to a 0.5B model and a 500B model. This approach has three fundamental blind spots when it comes to small model deployment:

1. They measure only intelligence. An engineer deploying AI on a smartphone needs to know not just "MMLU 70" but whether the model fits in 2GB, how often it fabricates information, and how many tokens it generates per second.

2. They miss small-model-specific failure modes. Our measurements show that a 1.3B model fabricates confident fake content about nonexistent people, papers, and products 80% of the time when prompted. This risk is invisible to traditional benchmarks.

3. They don't reward efficiency. A 14B model scoring higher than a 4B model is expected. But the fact that "4B achieves 95% of 14B's quality at 36% of the RAM" is lost in a single-column leaderboard. In the edge AI era, performance per unit of resource is what matters.


SHIFT: A 5-Axis Framework for What Actually Matters

SHIFT decomposes the deployment value of a small language model into five measurable axes:

Axis Meaning How Measured Questions
S โ€” Size Model footprint Parameter count, active params (MoE) Metadata
H โ€” Honesty Hallucination resistance, calibration, refusal balance 40 questions, fully auto-graded 40
I โ€” Intelligence Reasoning, math, coding, 7 languages, metacognition 85 questions, auto + LLM judge 85
F โ€” Fast Inference throughput tok/s measured via HF Inference API 15 samples
T โ€” Thrift Resource consumption Peak VRAM/RAM at Q4 quantization Metadata

Honesty (H) โ€” 40 Questions

Small models fail differently than large models. The most dangerous failure is confident fabrication. We test four dimensions:

Category Questions Method What It Catches
H1 โ€” Hallucination Trap 10 json_field_check Presents fake entities; model must refuse to fabricate
H2 โ€” Confidence Calibration 10 calibration_check Stated confidence must match actual accuracy
H3 โ€” Refusal Balance 10 refusal_check Penalizes both over-refusal and under-refusal
H4 โ€” Self-Correction 10 self_correction_check Model must catch and fix its own reasoning errors

Intelligence (I) โ€” 85 Questions

Category Questions Method Coverage
I1 โ€” Reasoning 15 answer_match Syllogisms, puzzles, pattern recognition
I2 โ€” Math 10 numeric_match Arithmetic through compound interest
I3 โ€” Coding 10 code_execution Python functions with executable test cases
I4 โ€” Multilingual 35 llm_judge 7 languages (KO, AR, PT, TR, BN, TH), 2.7B+ speakers
I5 โ€” Knowledge Synthesis 10 llm_judge Constrained explanations, critical thinking
I6 โ€” Metacognition 5 llm_judge Self-awareness, knowledge boundaries

All 125 questions require mandatory JSON output with verifiable fields. Grading uses structured field validation โ€” not keyword matching. 75 of 125 questions are fully automatic with zero human intervention.


3

WCS: The Official Ranking Metric

Why Not Just Use SHIFT or PIR Alone?

A small-model benchmark must measure two things simultaneously:

  • Quality: "How well does it perform?" โ†’ SHIFT score
  • Efficiency: "How much resource does it consume?" โ†’ PIR score

Using SHIFT alone, a 14B model naturally ranks above 1.7B โ€” which tells us nothing new. Using PIR alone, a 1.3B model with mediocre quality becomes #1 โ€” which is misleading. Both must be high for a ranking to be meaningful.

The Formula

WCS = โˆš( SHIFT ร— PIR_norm )

Where:
  SHIFT     = H ร— 0.4 + I ร— 0.6           (quality: honesty 40%, intelligence 60%)
  PIR       = (I ร— H ร— F) รท (S ร— T)       (efficiency: output per unit resource)
  PIR_norm  = logโ‚โ‚€(PIR) / logโ‚โ‚€(max) ร— 100  (normalized to 0โ€“100)

Why geometric mean? Arithmetic mean (A+B)/2 allows one strong axis to compensate for a weak one. Geometric mean โˆš(Aร—B) requires both to be strong โ€” a model that's smart but massive, or tiny but poor, ranks low under WCS.

Why log-normalize PIR? Raw PIR ranges from 18 to 6,952 โ€” a 386ร— spread. Logโ‚โ‚€ compression maps this to 0โ€“100, making it commensurable with SHIFT on the same scale.

Football League Tiers

Models are classified by actual runtime RAM (Q4 quantized), not raw parameter count. This correctly handles MoE models where total params โ‰  active params.

League RAM Target Hardware Season 1 Champion
๐Ÿฅ… League One < 2GB Raspberry Pi, IoT GPT-OSS-20B (WCS 82.6)
โšฝ La Liga 2โ€“4GB Smartphone Gemma-3n-E4B (WCS 81.8)
๐Ÿ… Premier League 4โ€“8GB Laptop (8GB) Qwen3-8B (WCS 72.8)
๐Ÿ† Champions League 8โ€“16GB Desktop PC Llama-4-Scout (WCS 79.3)

4

Season 1 Results: 18 Models, 12 Makers

All scores measured via HuggingFace Inference API across 8 providers. Speed measured with 5 prompts ร— 3 rounds per model. Evaluated March 2026.

Rank Model Maker WCS SHIFT PIR โšก tok/s League
๐Ÿ† GPT-OSS-20B OpenAI 82.6 76.9 2,586 71.9 ๐Ÿฅ…
๐Ÿฅˆ Gemma-3n-E4B Google 81.8 77.3 2,136 43.8 โšฝ
๐Ÿฅ‰ Llama-4-Scout Meta 79.3 74.2 1,804 240.5 ๐Ÿ†
4 Qwen3-4B Alibaba 76.6 76.8 858 50.0 โšฝ
5 Qwen3-1.7B Alibaba 76.1 66.8 2,148 30.1 ๐Ÿฅ…
6 GLM-4.7-Flash ๐Ÿง  Zhipu AI 73.2 74.8 566 50.8 โšฝ
7 Qwen3.5-35B-A3B Alibaba 72.9 75.3 517 108.7 ๐Ÿ†
8 Qwen3-8B Alibaba 72.8 76.9 445 186.8 ๐Ÿ…
9 Llama-3.2-1B Meta 70.5 49.7 6,952 113.2 ๐Ÿฅ…
10 Tiny-Aya-Fire Cohere 69.7 58.9 1,488 111.6 โšฝ
11 Qwen3.5-9B Alibaba 67.3 71.1 280 130.6 ๐Ÿ…
12 OLMo-3-7B AllenAI 65.5 70.2 224 50.0 ๐Ÿ…
13 DeepSeek-R1-7B ๐Ÿง  DeepSeek 65.4 68.2 257 69.2 ๐Ÿ…
14 Llama-3.1-8B Meta 62.4 61.0 282 187.7 ๐Ÿ…
15 Nemotron-Nano-8B ๐Ÿง  NVIDIA 58.4 65.9 98 29.8 ๐Ÿ…
16 Gemma-3-12B Google 55.0 75.7 34 18.7 ๐Ÿ†
17 Mistral-7B-v0.2 Mistral 53.0 60.6 60 17.8 ๐Ÿ…
18 DeepSeek-R1-14B ๐Ÿง  DeepSeek 44.2 59.8 18 21.4 ๐Ÿ†

๐Ÿง  = Thinking model (uses internal <think> reasoning tokens)


5

6

7

Key Findings

1. 4B Outperforms 8B

Gemma-3n-E4B (4B parameters, 2.0GB RAM) achieves SHIFT 77.3 โ€” the highest quality score among all 18 models. Qwen3-8B (8B, 5.5GB) scores 76.9. The gap is 0.4 points. The RAM difference is 2.75ร—.

This is not an anomaly. Qwen3-4B (4B, 2.8GB) also scores 76.8, essentially matching the 8B model at half the resource cost. The implication is concrete: smartphone-class hardware (2โ€“3GB) can now run laptop-class AI (5.5GB) with negligible quality loss.

2. The MoE Efficiency Revolution

GPT-OSS-20B is a Mixture-of-Experts model with 21B total parameters but only 3.6B active at inference. It fits in 1.5GB of RAM โ€” classified as League One (Raspberry Pi tier) โ€” yet achieves SHIFT 76.9, matching Champions League dense models that require 8โ€“9GB.

At comparable SHIFT scores (~76), the MoE model uses 5.7ร— less RAM than the equivalent dense model (Gemma-3-12B at 8.5GB). This validates MoE as the most promising architecture for edge deployment at scale.

3. Thinking Models: A Double-Edged Sword

Models that use internal reasoning tokens (<think>...</think>) face two penalties in SHIFT evaluation:

Quality penalty: The <think> tags interfere with structured JSON output, which is mandatory for all 125 questions. At comparable sizes:

  • Qwen3-8B (non-thinking) โ†’ SHIFT 76.9
  • DeepSeek-R1-7B (thinking) โ†’ SHIFT 68.2 (โˆ’8.7 points)

Speed penalty: Internal reasoning generates 2โ€“6ร— more tokens than visible output:

  • Qwen3-8B โ†’ 186.8 tok/s
  • DeepSeek-R1-7B โ†’ 69.2 tok/s (2.7ร— slower)
  • Nemotron-Nano-8B โ†’ 29.8 tok/s (6.3ร— slower)

This does not mean thinking models are inferior โ€” DeepSeek-R1-14B's reasoning score (I1 = 96.7) is the highest of any model tested. But for structured output tasks and real-time edge deployment, non-thinking models deliver substantially better practical value.

4. Hallucination Trap: The Most Discriminating Metric

H1 (Hallucination Trap) produces the widest score spread of any single metric: 80 points (from 20 to 100).

The test presents nonexistent people, academic papers, laws, and products, then asks the model to describe them. Models that correctly refuse score high; models that confidently fabricate content score low.

H1 = 100:  Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash, Qwen3.5-35B-A3B
H1 = 90:   Gemma-3n-E4B, Llama-4-Scout, Qwen3.5-9B
H1 = 80:   DeepSeek-R1-7B, Gemma-3-12B
H1 = 70:   Llama-3.1-8B, Nemotron-Nano-8B
H1 = 60:   Qwen3-1.7B, DeepSeek-R1-14B, OLMo-3-7B
H1 = 40:   Mistral-7B-v0.2
H1 = 30:   Tiny-Aya-Fire
H1 = 20:   Llama-3.2-1B

Two notable patterns emerge. First, the Qwen3 family achieves strong hallucination resistance across all sizes (60โ€“100 from 1.7B to 35B), suggesting that training pipeline design is more determinative than scale. Second, a 1.3B model fabricates fake content 80% of the time โ€” a critical safety consideration for IoT and embedded deployment.

5. The 1.7B Rebellion: Architecture Generation Beats Scale

Qwen3-1.7B (1.7B parameters, 1.2GB RAM, released 2025) outscores three significantly larger models:

  • Mistral-7B-v0.2 (7.2B, 5.0GB, 2024) โ†’ SHIFT 60.6 (โˆ’6.2 points)
  • Llama-3.1-8B (8.0B, 5.5GB, 2024) โ†’ SHIFT 61.0 (โˆ’5.8 points)
  • DeepSeek-R1-14B (14.8B, 9.5GB, 2025) โ†’ SHIFT 59.8 (โˆ’7.0 points)

A 1.7B model from the latest generation outperforms models 4โ€“9ร— its size from the previous generation. In the small-model domain, architecture vintage matters more than parameter count.


Union Eval: Direct Comparison with Frontier SOTA

Beyond the 125 SHIFT questions, each model takes a separate Union Eval consisting of 19 cross-benchmark questions. The same questions are administered to frontier-class models, enabling direct comparison:

Rank Frontier Model Union Score
1 Claude Sonnet 4.6 69.9
2 Claude Opus 4.6 69.3
3 GPT-5.4 62.4
4 DeepSeek V3.2 60.3
5 Qwen3.5-397B 57.1

Best small model: Gemma-3-12B = 57.1 (82% of Claude Sonnet)

A 12B open-source model matches the score of a 397B model on identical questions. This places the current frontier of small-model capability at approximately 80% of the best proprietary systems โ€” a gap that is closing rapidly.


Research Collaboration & Benchmark Ecosystem

Smol AI WorldCup was developed in collaboration with the FINAL Bench research team. The cross-benchmark question design for Union Eval and the SOTA comparison framework draw on FINAL Bench's evaluation methodology.

The benchmark also integrates with the ALL Bench Leaderboard, creating a complementary evaluation ecosystem:

  • Smol AI WorldCup answers: "Where does this 4B model rank among small models?"
  • ALL Bench Leaderboard answers: "Where does that 4B model rank against GPT-5 and Claude?"

This cross-benchmark interoperability provides model developers and deployers with a unified view across the full spectrum โ€” from edge-class small models to frontier-class large models โ€” within a single evaluation ecosystem. Rather than operating in isolation, both benchmarks share evaluation principles and directional data to support the broader open-source AI community.


Speed Measurement Methodology

All 18 models were measured under identical conditions via the HuggingFace Inference API:

  1. Warmup call โ€” First inference excluded (cold start bias removal)
  2. 5 diverse prompts โ€” Explanation, Python coding, Korean translation, JSON generation, arithmetic
  3. 3 rounds โ€” 15 total samples per model
  4. Metric: tok/s = completion_tokens / elapsed_seconds
Rank Model tok/s Provider
1 Llama-4-Scout 240.5 Groq
2 Llama-3.1-8B 187.7 Cerebras
3 Qwen3-8B 186.8 Fireworks
... ... ... ...
17 Mistral-7B-v0.2 17.8 Featherless

A key observation: provider infrastructure affects speed more than model size. The Groq inference chip runs Llama-4-Scout (17B MoE) at 240 tok/s, while the same-tier dense model on Featherless achieves only 18 tok/s โ€” a 13ร— gap driven entirely by hardware, not architecture. This real-world speed variance is captured in PIR and propagated into WCS.


Season System: Preventing Benchmark Contamination

Season 1 (Current)
Total questions 125
Anchor questions 30 (fixed across seasons for IRT calibration)
Rotating questions 95 (70%+ replaced each season)
Union Eval 19 (undisclosed, rotated seasonally)
Period 2026 Q1
Next season 2026 Q3 (planned)

The 30 anchor questions remain identical across seasons for Item Response Theory (IRT) calibration, enabling cross-season comparability. The 95 rotating questions are replaced to prevent benchmark contamination โ€” a growing concern as models increasingly train on public evaluation data. Previous seasons' rotating questions are released publicly after each season concludes.


Evaluated Models (18)

# HuggingFace Repo ID Display Name Maker Arch
1 meta-llama/Llama-3.2-1B-Instruct Llama-3.2-1B Meta Dense
2 Qwen/Qwen3-1.7B Qwen3-1.7B Alibaba Dense
3 openai/gpt-oss-20b GPT-OSS-20B OpenAI MoE ๐Ÿง 
4 CohereLabs/tiny-aya-fire Tiny-Aya-Fire Cohere Dense
5 Qwen/Qwen3-4B-Instruct-2507 Qwen3-4B Alibaba Dense
6 google/gemma-3n-E4B-it Gemma-3n-E4B Google PLE
7 zai-org/GLM-4.7-Flash GLM-4.7-Flash Zhipu AI MoE ๐Ÿง 
8 mistralai/Mistral-7B-Instruct-v0.2 Mistral-7B-v0.2 Mistral Dense
9 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B DeepSeek-R1-7B DeepSeek Dense ๐Ÿง 
10 Qwen/Qwen3-8B Qwen3-8B Alibaba Dense
11 meta-llama/Llama-3.1-8B-Instruct Llama-3.1-8B Meta Dense
12 nvidia/Llama-3.1-Nemotron-Nano-8B-v1 Nemotron-Nano-8B NVIDIA Dense ๐Ÿง 
13 Qwen/Qwen3.5-9B Qwen3.5-9B Alibaba Dense
14 allenai/Olmo-3-7B-Instruct OLMo-3-7B AllenAI Dense
15 google/gemma-3-12b-it Gemma-3-12B Google Dense
16 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B DeepSeek-R1-14B DeepSeek Dense ๐Ÿง 
17 Qwen/Qwen3.5-35B-A3B Qwen3.5-35B-A3B Alibaba MoE
18 meta-llama/Llama-4-Scout-17B-16E-Instruct Llama-4-Scout Meta MoE

12 makers represented: Meta, Alibaba, Google, OpenAI, DeepSeek, Mistral, NVIDIA, Cohere, Zhipu AI, AllenAI, Nanbeige (in dataset, future season), and others.


Recommendations by Use Case

Use Case Recommended Model WCS Rationale
๐Ÿ† Overall best GPT-OSS-20B 82.6 Highest WCS โ€” top tier in both quality and efficiency
โญ Highest quality Gemma-3n-E4B 81.8 SHIFT 77.3 (#1 quality) in just 2GB of RAM
โšก Fastest inference Llama-4-Scout 79.3 240.5 tok/s โ€” 13ร— faster than comparable dense models
๐Ÿ“ฑ Smartphone deployment Gemma-3n-E4B 81.8 Runs on 4GB phones with quality exceeding most laptop models
๐Ÿง  Most trustworthy Qwen3-8B 72.8 Honesty score H=87.9 โ€” highest of any model tested
๐Ÿ’ฐ Best value GPT-OSS-20B 82.6 Champions-quality AI in 1.5GB โ€” fits on a Raspberry Pi
๐Ÿ–ฅ๏ธ Closest to SOTA Gemma-3-12B 55.0 Union 57.1 = 82% of Claude Sonnet on identical questions
๐ŸŒ Best multilingual Gemma-3n-E4B 81.8 I4=65.2 โ€” highest score across 7 languages

Conclusion

The Smol AI WorldCup Season 1 evaluation of 18 models across 12 makers yields a clear conclusion:

Architecture and training quality โ€” not parameter count โ€” determine practical deployment value.

A 4B model outperforms 8B at 36% of the RAM. A 1.5GB MoE matches Champions-league dense models. A 2025-generation 1.7B outscores three 2024-generation 7โ€“14B models. These findings challenge the scaling assumption that "bigger is always better" with measured, reproducible evidence.

The SHIFT framework and WCS metric provide the first integrated scoring system that captures all five axes that matter for edge AI deployment. The benchmark's season system, with 70%+ question rotation and IRT-calibrated anchors, is designed for long-term durability against contamination.

We look forward to Season 2 (2026 Q3), which will expand the model roster, rotate the question set, and refine the evaluation methodology based on community feedback.

The dataset is publicly available under Apache 2.0. We welcome submissions of new models for evaluation.


Developed by Ginigen.ai in collaboration with the FINAL Bench research team.

Small but Mighty AI.

Community

Sign up or log in to comment