Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
SeaWolf-AIΒ 
posted an update about 23 hours ago
Post
3577
🏟️ Smol AI WorldCup: A 4B Model Just Beat 8B β€” Here's the Data

We evaluated 18 small language models from 12 makers on 125 questions across 7 languages. The results challenge the assumption that bigger is always better.

Community Article: https://huggingface.co/blog/FINAL-Bench/smol-worldcup
Live Leaderboard: ginigen-ai/smol-worldcup
Dataset: ginigen-ai/smol-worldcup

What we found:

β†’ Gemma-3n-E4B (4B, 2GB RAM) outscores Qwen3-8B (8B, 5.5GB). Doubling parameters gained only 0.4 points. RAM cost: 2.75x more.

β†’ GPT-OSS-20B fits in 1.5GB yet matches Champions-league dense models requiring 8.5GB. MoE architecture is the edge AI game-changer.

β†’ Thinking models hurt structured output. DeepSeek-R1-7B scores 8.7 points below same-size Qwen3-8B and runs 2.7x slower.

β†’ A 1.3B model fabricates confident fake content 80% of the time when prompted with nonexistent entities. Qwen3 family hits 100% trap detection across all sizes.

β†’ Qwen3-1.7B (1.2GB) outscores Mistral-7B, Llama-3.1-8B, and DeepSeek-R1-14B. Latest architecture at 1.7B beats older architecture at 14B.

What makes this benchmark different?

Most benchmarks ask "how smart?" β€” we measure five axes simultaneously: Size, Honesty, Intelligence, Fast, Thrift (SHIFT). Our ranking metric WCS = sqrt(SHIFT x PIR_norm) rewards models that are both high-quality AND efficient. Smart but massive? Low rank. Tiny but poor? Also low.

Top 5 by WCS:
1. GPT-OSS-20B β€” WCS 82.6 β€” 1.5GB β€” Raspberry Pi tier
2. Gemma-3n-E4B β€” WCS 81.8 β€” 2.0GB β€” Smartphone tier
3. Llama-4-Scout β€” WCS 79.3 β€” 240 tok/s β€” Fastest model
4. Qwen3-4B β€” WCS 76.6 β€” 2.8GB β€” Smartphone tier
5. Qwen3-1.7B β€” WCS 76.1 β€” 1.2GB β€” IoT tier

Built in collaboration with the FINAL Bench research team. Interoperable with ALL Bench Leaderboard for full small-to-large model comparison.

Dataset is open under Apache 2.0 (125 questions, 7 languages). We welcome new model submissions.
In this post