🏟️ Smol AI WorldCup: A 5-Axis Benchmark That Reveals What Small Language Models Can Really Do

Published March 10, 2026

A 4B model outperforms 8B. A 1.5GB MoE achieves Champions-level quality. A 1.7B beats three 7–14B models. These aren't hypothetical claims — they're measured results from 18 models, 12 makers, 125 questions, and 7 languages.

We introduce Smol AI WorldCup, the first benchmark designed specifically for the deployment realities of small language models. Instead of asking only "how smart is it?", we ask five questions simultaneously: How smart? How honest? How fast? How small? How efficient?

The result is SHIFT — a 5-axis evaluation framework — and WCS (WorldCup Score) — a composite metric that rewards models achieving both high quality and high efficiency.


🏟️ Live Leaderboard	huggingface.co/spaces/ginigen-ai/smol-worldcup
📊 Dataset (125Q, Apache 2.0)	huggingface.co/datasets/ginigen-ai/smol-worldcup
🏅 ALL Bench Leaderboard	huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard

The Problem: Existing Benchmarks Weren't Built for Edge AI

MMLU, GPQA, and HumanEval apply the same test to a 0.5B model and a 500B model. This approach has three fundamental blind spots when it comes to small model deployment:

1. They measure only intelligence. An engineer deploying AI on a smartphone needs to know not just "MMLU 70" but whether the model fits in 2GB, how often it fabricates information, and how many tokens it generates per second.

2. They miss small-model-specific failure modes. Our measurements show that a 1.3B model fabricates confident fake content about nonexistent people, papers, and products 80% of the time when prompted. This risk is invisible to traditional benchmarks.

3. They don't reward efficiency. A 14B model scoring higher than a 4B model is expected. But the fact that "4B achieves 95% of 14B's quality at 36% of the RAM" is lost in a single-column leaderboard. In the edge AI era, performance per unit of resource is what matters.

SHIFT: A 5-Axis Framework for What Actually Matters

SHIFT decomposes the deployment value of a small language model into five measurable axes:

Axis	Meaning	How Measured	Questions
S — Size	Model footprint	Parameter count, active params (MoE)	Metadata
H — Honesty	Hallucination resistance, calibration, refusal balance	40 questions, fully auto-graded	40
I — Intelligence	Reasoning, math, coding, 7 languages, metacognition	85 questions, auto + LLM judge	85
F — Fast	Inference throughput	tok/s measured via HF Inference API	15 samples
T — Thrift	Resource consumption	Peak VRAM/RAM at Q4 quantization	Metadata

Honesty (H) — 40 Questions

Small models fail differently than large models. The most dangerous failure is confident fabrication. We test four dimensions:

Category	Questions	Method	What It Catches
H1 — Hallucination Trap	10	json_field_check	Presents fake entities; model must refuse to fabricate
H2 — Confidence Calibration	10	calibration_check	Stated confidence must match actual accuracy
H3 — Refusal Balance	10	refusal_check	Penalizes both over-refusal and under-refusal
H4 — Self-Correction	10	self_correction_check	Model must catch and fix its own reasoning errors

Intelligence (I) — 85 Questions

Category	Questions	Method	Coverage
I1 — Reasoning	15	answer_match	Syllogisms, puzzles, pattern recognition
I2 — Math	10	numeric_match	Arithmetic through compound interest
I3 — Coding	10	code_execution	Python functions with executable test cases
I4 — Multilingual	35	llm_judge	7 languages (KO, AR, PT, TR, BN, TH), 2.7B+ speakers
I5 — Knowledge Synthesis	10	llm_judge	Constrained explanations, critical thinking
I6 — Metacognition	5	llm_judge	Self-awareness, knowledge boundaries

All 125 questions require mandatory JSON output with verifiable fields. Grading uses structured field validation — not keyword matching. 75 of 125 questions are fully automatic with zero human intervention.

WCS: The Official Ranking Metric

Why Not Just Use SHIFT or PIR Alone?

A small-model benchmark must measure two things simultaneously:

Quality: "How well does it perform?" → SHIFT score
Efficiency: "How much resource does it consume?" → PIR score

Using SHIFT alone, a 14B model naturally ranks above 1.7B — which tells us nothing new. Using PIR alone, a 1.3B model with mediocre quality becomes #1 — which is misleading. Both must be high for a ranking to be meaningful.

The Formula

WCS = √( SHIFT × PIR_norm )

Where:
  SHIFT     = H × 0.4 + I × 0.6           (quality: honesty 40%, intelligence 60%)
  PIR       = (I × H × F) ÷ (S × T)       (efficiency: output per unit resource)
  PIR_norm  = log₁₀(PIR) / log₁₀(max) × 100  (normalized to 0–100)

Why geometric mean? Arithmetic mean (A+B)/2 allows one strong axis to compensate for a weak one. Geometric mean √(A×B) requires both to be strong — a model that's smart but massive, or tiny but poor, ranks low under WCS.

Why log-normalize PIR? Raw PIR ranges from 18 to 6,952 — a 386× spread. Log₁₀ compression maps this to 0–100, making it commensurable with SHIFT on the same scale.

Football League Tiers

Models are classified by actual runtime RAM (Q4 quantized), not raw parameter count. This correctly handles MoE models where total params ≠ active params.

League	RAM	Target Hardware	Season 1 Champion
🥅 League One	< 2GB	Raspberry Pi, IoT	GPT-OSS-20B (WCS 82.6)
⚽ La Liga	2–4GB	Smartphone	Gemma-3n-E4B (WCS 81.8)
🏅 Premier League	4–8GB	Laptop (8GB)	Qwen3-8B (WCS 72.8)
🏆 Champions League	8–16GB	Desktop PC	Llama-4-Scout (WCS 79.3)

Season 1 Results: 18 Models, 12 Makers

All scores measured via HuggingFace Inference API across 8 providers. Speed measured with 5 prompts × 3 rounds per model. Evaluated March 2026.

Rank	Model	Maker	WCS	SHIFT	PIR	⚡ tok/s	League
🏆	GPT-OSS-20B	OpenAI	82.6	76.9	2,586	71.9	🥅
🥈	Gemma-3n-E4B	Google	81.8	77.3	2,136	43.8	⚽
🥉	Llama-4-Scout	Meta	79.3	74.2	1,804	240.5	🏆
4	Qwen3-4B	Alibaba	76.6	76.8	858	50.0	⚽
5	Qwen3-1.7B	Alibaba	76.1	66.8	2,148	30.1	🥅
6	GLM-4.7-Flash 🧠	Zhipu AI	73.2	74.8	566	50.8	⚽
7	Qwen3.5-35B-A3B	Alibaba	72.9	75.3	517	108.7	🏆
8	Qwen3-8B	Alibaba	72.8	76.9	445	186.8	🏅
9	Llama-3.2-1B	Meta	70.5	49.7	6,952	113.2	🥅
10	Tiny-Aya-Fire	Cohere	69.7	58.9	1,488	111.6	⚽
11	Qwen3.5-9B	Alibaba	67.3	71.1	280	130.6	🏅
12	OLMo-3-7B	AllenAI	65.5	70.2	224	50.0	🏅
13	DeepSeek-R1-7B 🧠	DeepSeek	65.4	68.2	257	69.2	🏅
14	Llama-3.1-8B	Meta	62.4	61.0	282	187.7	🏅
15	Nemotron-Nano-8B 🧠	NVIDIA	58.4	65.9	98	29.8	🏅
16	Gemma-3-12B	Google	55.0	75.7	34	18.7	🏆
17	Mistral-7B-v0.2	Mistral	53.0	60.6	60	17.8	🏅
18	DeepSeek-R1-14B 🧠	DeepSeek	44.2	59.8	18	21.4	🏆

🧠 = Thinking model (uses internal <think> reasoning tokens)

Key Findings

1. 4B Outperforms 8B

Gemma-3n-E4B (4B parameters, 2.0GB RAM) achieves SHIFT 77.3 — the highest quality score among all 18 models. Qwen3-8B (8B, 5.5GB) scores 76.9. The gap is 0.4 points. The RAM difference is 2.75×.

This is not an anomaly. Qwen3-4B (4B, 2.8GB) also scores 76.8, essentially matching the 8B model at half the resource cost. The implication is concrete: smartphone-class hardware (2–3GB) can now run laptop-class AI (5.5GB) with negligible quality loss.

2. The MoE Efficiency Revolution

GPT-OSS-20B is a Mixture-of-Experts model with 21B total parameters but only 3.6B active at inference. It fits in 1.5GB of RAM — classified as League One (Raspberry Pi tier) — yet achieves SHIFT 76.9, matching Champions League dense models that require 8–9GB.

At comparable SHIFT scores (~76), the MoE model uses 5.7× less RAM than the equivalent dense model (Gemma-3-12B at 8.5GB). This validates MoE as the most promising architecture for edge deployment at scale.

3. Thinking Models: A Double-Edged Sword

Models that use internal reasoning tokens (<think>...</think>) face two penalties in SHIFT evaluation:

Quality penalty: The <think> tags interfere with structured JSON output, which is mandatory for all 125 questions. At comparable sizes:

Qwen3-8B (non-thinking) → SHIFT 76.9
DeepSeek-R1-7B (thinking) → SHIFT 68.2 (−8.7 points)

Speed penalty: Internal reasoning generates 2–6× more tokens than visible output:

Qwen3-8B → 186.8 tok/s
DeepSeek-R1-7B → 69.2 tok/s (2.7× slower)
Nemotron-Nano-8B → 29.8 tok/s (6.3× slower)

This does not mean thinking models are inferior — DeepSeek-R1-14B's reasoning score (I1 = 96.7) is the highest of any model tested. But for structured output tasks and real-time edge deployment, non-thinking models deliver substantially better practical value.

4. Hallucination Trap: The Most Discriminating Metric

H1 (Hallucination Trap) produces the widest score spread of any single metric: 80 points (from 20 to 100).

The test presents nonexistent people, academic papers, laws, and products, then asks the model to describe them. Models that correctly refuse score high; models that confidently fabricate content score low.

H1 = 100:  Qwen3-4B, Qwen3-8B, GPT-OSS-20B, GLM-4.7-Flash, Qwen3.5-35B-A3B
H1 = 90:   Gemma-3n-E4B, Llama-4-Scout, Qwen3.5-9B
H1 = 80:   DeepSeek-R1-7B, Gemma-3-12B
H1 = 70:   Llama-3.1-8B, Nemotron-Nano-8B
H1 = 60:   Qwen3-1.7B, DeepSeek-R1-14B, OLMo-3-7B
H1 = 40:   Mistral-7B-v0.2
H1 = 30:   Tiny-Aya-Fire
H1 = 20:   Llama-3.2-1B

Two notable patterns emerge. First, the Qwen3 family achieves strong hallucination resistance across all sizes (60–100 from 1.7B to 35B), suggesting that training pipeline design is more determinative than scale. Second, a 1.3B model fabricates fake content 80% of the time — a critical safety consideration for IoT and embedded deployment.

5. The 1.7B Rebellion: Architecture Generation Beats Scale

Qwen3-1.7B (1.7B parameters, 1.2GB RAM, released 2025) outscores three significantly larger models:

Mistral-7B-v0.2 (7.2B, 5.0GB, 2024) → SHIFT 60.6 (−6.2 points)
Llama-3.1-8B (8.0B, 5.5GB, 2024) → SHIFT 61.0 (−5.8 points)
DeepSeek-R1-14B (14.8B, 9.5GB, 2025) → SHIFT 59.8 (−7.0 points)

A 1.7B model from the latest generation outperforms models 4–9× its size from the previous generation. In the small-model domain, architecture vintage matters more than parameter count.

Union Eval: Direct Comparison with Frontier SOTA

Beyond the 125 SHIFT questions, each model takes a separate Union Eval consisting of 19 cross-benchmark questions. The same questions are administered to frontier-class models, enabling direct comparison:

Rank	Frontier Model	Union Score
1	Claude Sonnet 4.6	69.9
2	Claude Opus 4.6	69.3
3	GPT-5.4	62.4
4	DeepSeek V3.2	60.3
5	Qwen3.5-397B	57.1

Best small model: Gemma-3-12B = 57.1 (82% of Claude Sonnet)

A 12B open-source model matches the score of a 397B model on identical questions. This places the current frontier of small-model capability at approximately 80% of the best proprietary systems — a gap that is closing rapidly.

Research Collaboration & Benchmark Ecosystem

Smol AI WorldCup was developed in collaboration with the FINAL Bench research team. The cross-benchmark question design for Union Eval and the SOTA comparison framework draw on FINAL Bench's evaluation methodology.

The benchmark also integrates with the ALL Bench Leaderboard, creating a complementary evaluation ecosystem:

Smol AI WorldCup answers: "Where does this 4B model rank among small models?"
ALL Bench Leaderboard answers: "Where does that 4B model rank against GPT-5 and Claude?"

This cross-benchmark interoperability provides model developers and deployers with a unified view across the full spectrum — from edge-class small models to frontier-class large models — within a single evaluation ecosystem. Rather than operating in isolation, both benchmarks share evaluation principles and directional data to support the broader open-source AI community.

Speed Measurement Methodology

All 18 models were measured under identical conditions via the HuggingFace Inference API:

Warmup call — First inference excluded (cold start bias removal)
5 diverse prompts — Explanation, Python coding, Korean translation, JSON generation, arithmetic
3 rounds — 15 total samples per model
Metric: tok/s = completion_tokens / elapsed_seconds

Rank	Model	tok/s	Provider
1	Llama-4-Scout	240.5	Groq
2	Llama-3.1-8B	187.7	Cerebras
3	Qwen3-8B	186.8	Fireworks
...	...	...	...
17	Mistral-7B-v0.2	17.8	Featherless

A key observation: provider infrastructure affects speed more than model size. The Groq inference chip runs Llama-4-Scout (17B MoE) at 240 tok/s, while the same-tier dense model on Featherless achieves only 18 tok/s — a 13× gap driven entirely by hardware, not architecture. This real-world speed variance is captured in PIR and propagated into WCS.

Season System: Preventing Benchmark Contamination

	Season 1 (Current)
Total questions	125
Anchor questions	30 (fixed across seasons for IRT calibration)
Rotating questions	95 (70%+ replaced each season)
Union Eval	19 (undisclosed, rotated seasonally)
Period	2026 Q1
Next season	2026 Q3 (planned)

The 30 anchor questions remain identical across seasons for Item Response Theory (IRT) calibration, enabling cross-season comparability. The 95 rotating questions are replaced to prevent benchmark contamination — a growing concern as models increasingly train on public evaluation data. Previous seasons' rotating questions are released publicly after each season concludes.

Evaluated Models (18)

#	HuggingFace Repo ID	Display Name	Maker	Arch
1	`meta-llama/Llama-3.2-1B-Instruct`	Llama-3.2-1B	Meta	Dense
2	`Qwen/Qwen3-1.7B`	Qwen3-1.7B	Alibaba	Dense
3	`openai/gpt-oss-20b`	GPT-OSS-20B	OpenAI	MoE 🧠
4	`CohereLabs/tiny-aya-fire`	Tiny-Aya-Fire	Cohere	Dense
5	`Qwen/Qwen3-4B-Instruct-2507`	Qwen3-4B	Alibaba	Dense
6	`google/gemma-3n-E4B-it`	Gemma-3n-E4B	Google	PLE
7	`zai-org/GLM-4.7-Flash`	GLM-4.7-Flash	Zhipu AI	MoE 🧠
8	`mistralai/Mistral-7B-Instruct-v0.2`	Mistral-7B-v0.2	Mistral	Dense
9	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`	DeepSeek-R1-7B	DeepSeek	Dense 🧠
10	`Qwen/Qwen3-8B`	Qwen3-8B	Alibaba	Dense
11	`meta-llama/Llama-3.1-8B-Instruct`	Llama-3.1-8B	Meta	Dense
12	`nvidia/Llama-3.1-Nemotron-Nano-8B-v1`	Nemotron-Nano-8B	NVIDIA	Dense 🧠
13	`Qwen/Qwen3.5-9B`	Qwen3.5-9B	Alibaba	Dense
14	`allenai/Olmo-3-7B-Instruct`	OLMo-3-7B	AllenAI	Dense
15	`google/gemma-3-12b-it`	Gemma-3-12B	Google	Dense
16	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`	DeepSeek-R1-14B	DeepSeek	Dense 🧠
17	`Qwen/Qwen3.5-35B-A3B`	Qwen3.5-35B-A3B	Alibaba	MoE
18	`meta-llama/Llama-4-Scout-17B-16E-Instruct`	Llama-4-Scout	Meta	MoE

12 makers represented: Meta, Alibaba, Google, OpenAI, DeepSeek, Mistral, NVIDIA, Cohere, Zhipu AI, AllenAI, Nanbeige (in dataset, future season), and others.

Recommendations by Use Case

Use Case	Recommended Model	WCS	Rationale
🏆 Overall best	GPT-OSS-20B	82.6	Highest WCS — top tier in both quality and efficiency
⭐ Highest quality	Gemma-3n-E4B	81.8	SHIFT 77.3 (#1 quality) in just 2GB of RAM
⚡ Fastest inference	Llama-4-Scout	79.3	240.5 tok/s — 13× faster than comparable dense models
📱 Smartphone deployment	Gemma-3n-E4B	81.8	Runs on 4GB phones with quality exceeding most laptop models
🧠 Most trustworthy	Qwen3-8B	72.8	Honesty score H=87.9 — highest of any model tested
💰 Best value	GPT-OSS-20B	82.6	Champions-quality AI in 1.5GB — fits on a Raspberry Pi
🖥️ Closest to SOTA	Gemma-3-12B	55.0	Union 57.1 = 82% of Claude Sonnet on identical questions
🌍 Best multilingual	Gemma-3n-E4B	81.8	I4=65.2 — highest score across 7 languages

Conclusion

The Smol AI WorldCup Season 1 evaluation of 18 models across 12 makers yields a clear conclusion:

Architecture and training quality — not parameter count — determine practical deployment value.

A 4B model outperforms 8B at 36% of the RAM. A 1.5GB MoE matches Champions-league dense models. A 2025-generation 1.7B outscores three 2024-generation 7–14B models. These findings challenge the scaling assumption that "bigger is always better" with measured, reproducible evidence.

The SHIFT framework and WCS metric provide the first integrated scoring system that captures all five axes that matter for edge AI deployment. The benchmark's season system, with 70%+ question rotation and IRT-calibrated anchors, is designed for long-term durability against contamination.

We look forward to Season 2 (2026 Q3), which will expand the model roster, rotate the question set, and refine the evaluation methodology based on community feedback.

The dataset is publicly available under Apache 2.0. We welcome submissions of new models for evaluation.

Developed by Ginigen.ai in collaboration with the FINAL Bench research team.

Small but Mighty AI.

Datasets mentioned in this article 1

Spaces mentioned in this article 2

Chitos: From Detection to Proof — An Autonomous Security AI That Actually Exploits

June 29, 2026

FINAL-Bench Quantum: An Open, Neutral Benchmark for Quantum-Computing Methods

June 14, 2026

Community

ykarout

Mar 11

gpt-oss-20b on 1.5GB RAM? Which inference framework are you using for that? llama.cpp?

alfredo-ottomate

Mar 11

Just saying that GPT OSS 20B (~14 GB quantized) can fit on 1.5 GB RAM on a raspberry pi and run at 71 t/s tells me that this AI slop was probably produced by a constrained GPT OSS 20B model hahaha

urtuuuu

Mar 11

How can oss 20b be a good model?
Ask it these questions and it will make at least several mistakes:

Prompt: "Describe the impact of the Industrial Revolution on society."
Prompt: "Explain the significance of the Renaissance in the development of Western art and culture."

hxii

Mar 11

108.7 tok/s on Qwen? 240.5 on Llama 4? I'm sorry – what kind of RAM are you running this on?
Without concrete HW data the above benchmarks just sound invented.

SeaWolf-AI

Article author Mar 12

This was measured by running 15 consecutive inference tests for each model through the Inference API provided by Hugging Face. Of course, it is reasonable to assume that the VRAM environment provided by Hugging Face was highly capable.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

🏟️ Smol AI WorldCup: A 5-Axis Benchmark That Reveals What Small Language Models Can Really Do

The Problem: Existing Benchmarks Weren't Built for Edge AI

SHIFT: A 5-Axis Framework for What Actually Matters

Honesty (H) — 40 Questions

Intelligence (I) — 85 Questions

WCS: The Official Ranking Metric

Why Not Just Use SHIFT or PIR Alone?

The Formula

Football League Tiers

Season 1 Results: 18 Models, 12 Makers

Key Findings

1. 4B Outperforms 8B

2. The MoE Efficiency Revolution

3. Thinking Models: A Double-Edged Sword

4. Hallucination Trap: The Most Discriminating Metric

5. The 1.7B Rebellion: Architecture Generation Beats Scale

Union Eval: Direct Comparison with Frontier SOTA

Research Collaboration & Benchmark Ecosystem

Speed Measurement Methodology

Season System: Preventing Benchmark Contamination

Evaluated Models (18)

Recommendations by Use Case

Conclusion

Datasets mentioned in this article 1

Spaces mentioned in this article 2

ALL Bench Leaderboard

Leaderboard of Smol Worldcup

Chitos: From Detection to Proof — An Autonomous Security AI That Actually Exploits

FINAL-Bench Quantum: An Open, Neutral Benchmark for Quantum-Computing Methods

Community

Datasets mentioned in this article 1

Spaces mentioned in this article 2

ALL Bench Leaderboard

Leaderboard of Smol Worldcup