Final 7-Model Comparison β Agent Task Performance on Mac Mini M4 16GB
Date: 2026-04-07 | Hardware: Mac Mini M4 16GB (Dyson)
Results At a Glance
Model Size Speed Agent Fit Verdict
(GB) (tok/s) Score 16GB?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Qwopus3.5-27B Q3 11.0 OOM N/A β Too big β needs 24GB+
Qwen3.5-9B Q4 5.6 10 4.5/6 β
Most reliable agent
Gemma4 E4B 4bit 5.0 35 4.5/6 β
Fastest viable agent
Bonsai-8B 1bit 1.15 49 1.0/6 β
Single-turn only
LFM2.5-Nova 1.2B 0.70 118 0.0/6 β
Context too small (4K)
FunctionGemma 270M 0.28 197 0.0/6 β
Repeats infinitely
Qwopus3.5-27B Q3 11.0 OOM N/A β OOM at any context
Detailed Comparison
βββββββββββββββββββββββ¬βββββββββ¬βββββββββ¬ββββββββ¬ββββββββ¬βββββββββ¬ββββββββ¬βββββββ
β Model βParams βDisk βMemory βGen βAgent βMulti βProxy β
β β(active)β(GB) β(GB) βtok/s βScore β-step? βNeed? β
βββββββββββββββββββββββΌβββββββββΌβββββββββΌββββββββΌββββββββΌβββββββββΌββββββββΌβββββββ€
β Qwopus3.5-27B v3 Q3 β 27B β 11.0 β 14+ β OOM β N/A β ? β No β
β Qwen3.5-9B Q4_K_XL β 9B β 5.6 β 6.5 β 10 β 4.5/6 β β
β No β
β Gemma4 E4B 4bit β 4B MoE β 5.0 β 5.4 β 35 β 4.5/6 β β
β Yes β
β Bonsai-8B Q1_0 β 8B β 1.15 β 1.5 β 49 β 1.0/6 β β β No β
β LFM2.5-Nova 1.2B Q4 β 1.2B β 0.70 β 0.8 β 118 β 0.0/6 β β β No β
β FunctionGemma 270M β 270M β 0.28 β 0.3 β 197 β 0.0/6 β β β No β
βββββββββββββββββββββββ΄βββββββββ΄βββββββββ΄ββββββββ΄ββββββββ΄βββββββββ΄ββββββββ΄βββββββ
Speed vs Capability Cliff
Agent Score (6 = perfect)
5 β β
Qwen (10 tok/s) β
Gemma4 (35 tok/s)
β
4 β
β
3 β
β
2 β
β
1 β β
Bonsai (49 tok/s)
β
0 β β Qwopus β
LFM2.5 β
FuncGemma
β (OOM) (118 tok/s) (197 tok/s)
ββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββ
0 10 35 50 118 197
Generation Speed (tok/s)
β
= tested β = OOM
THE CLIFF: There is a hard capability cliff between 4B active params
and below. Models under 4B active params cannot do multi-step agent
tasks regardless of how fast they are.
Task Breakdown (6 tests)
ββββββ¬ββββββββββββββββββ¬βββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β # β Task βDiffβ Qwen 9B βGemma4 E4BβBonsai 8B β LFM 1.2B βFuncG 270Mβ
ββββββΌββββββββββββββββββΌβββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β 1 β Wikipedia info β E β β οΈ 1T β β
4T β β οΈ 1T β β OOC β β Loop β
β 2 β DDG search β M β β οΈ 6T β β
6T β β 16T β β OOC β β Loop β
β 3 β HN top story β E β β
2T β β
2T β β οΈ 1T β β 0T β β Loop β
β 4 β Cat vision (FP) β M β β
2T β β
3T β β 1T β β 0T β β Loop β
β 5 β Form filling β M β β
6T β β 1T β β 1T β β 0T β β Loop β
β 6 β reCAPTCHA β H β β οΈ 6T β β οΈ 13T β β 1T β β 0T β β Loop β
ββββββΌββββββββββββββββββΌβββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β β Score β β 4.5/6 β 4.5/6 β 1.0/6 β 0.0/6 β 0.0/6 β
β β Total time β β 834s β 340s β 103s β ~5s β ~5s β
β β Total tools β β 23 β 29 β 21 β 0 β 0 β
ββββββ΄ββββββββββββββββββ΄βββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
T = tool calls | E = Easy | M = Medium | H = Hard
OOC = Out of Context | Loop = infinite repetition
Why Small Models Fail at Agent Tasks
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE AGENT CAPABILITY LADDER β
β β
β Level 1: FORMAT A TOOL CALL β
β ββ Can generate {"name": "func", "arguments": {...}} β
β ββ ALL models pass this (even 270M at 197 tok/s) β
β ββ This is what BFCL benchmarks measure β
β β
β Level 2: UNDERSTAND TOOL RESULTS β
β ββ Read tool output and decide next action β
β ββ Requires: context understanding, error handling β
β ββ Bonsai-8B fails here (makes 1 call then stops) β
β β
β Level 3: CHAIN MULTIPLE TOOLS β
β ββ Navigate β type β click β extract β report β
β ββ Requires: planning, sequential reasoning, 5K+ context β
β ββ Minimum ~4B active params (Gemma4 E4B) β
β β
β Level 4: HANDLE ERRORS AND ADAPT β
β ββ Tool fails β try different approach β recover β
β ββ Requires: robust reasoning, error patterns β
β ββ Qwen 9B reliable, Gemma4 E4B partial β
β β
β Level 5: COMPLEX MULTI-FIELD INTERACTION β
β ββ Fill forms, interact with dynamic UIs β
β ββ Requires: deep context, field mapping, DOM understanding β
β ββ Only Qwen 9B succeeds (form filling) β
β ββ Likely needs 27B+ for consistent success β
β β
β BFCL = Level 1 only. Our tests = Levels 1-5. β
β That's why Bonsai scores 73% on BFCL but 1/6 on our tests. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Memory Map on 16GB
Available: 16 GB
Qwopus 27B Q3:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 14+ GB β β OOM
ββββ OS (3GB)
Qwen 9B + Falcon:
ββββββββββββββββββββββββββββ 6.5 GB model
ββββββ 1.5 GB Falcon
ββ 0.8 GB GUA+Browser
ββββ OS (3GB)
βββββ 4.2 GB free β
Gemma4 E4B + Falcon:
ββββββββββββββββββββββ 5.4 GB model
ββββββ 1.5 GB Falcon
ββ 0.8 GB GUA+Browser+Proxy
ββββ OS (3GB)
βββββββ 5.2 GB free β
(most headroom)
Bonsai 8B + Qwen 9B (dual!):
βββββ 1.5 GB Bonsai
ββββββββββββββββββββββββββββ 6.5 GB Qwen
ββββββ 1.5 GB Falcon
ββ 0.8 GB GUA+Browser
ββββ OS (3GB)
ββ 2.7 GB free β
(tight but fits!)
Final Verdict
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π BEST OVERALL: Gemma4 E4B (with proxy fixes) β
β β’ 4.5/6 score, 35 tok/s, 5.4 GB β
β β’ Best speed-to-capability ratio β
β β’ Wins DDG search, Wikipedia, HN, vision tasks β
β β’ Needs proxy (7 fixes) but works reliably β
β β
β π₯ MOST RELIABLE: Qwen3.5-9B β
β β’ 4.5/6 score, 10 tok/s, 6.5 GB β
β β’ ONLY model that fills forms successfully β
β β’ No proxy needed, native tool calling β
β β’ Slower but handles edge cases better β
β β
β π₯ HONORABLE: Bonsai-8B β
β β’ 1.0/6 but only 1.15 GB β fits alongside any other model β
β β’ Could serve as fast first-call router β
β β
β β DON'T USE FOR AGENTS: β
β β’ LFM2.5-Nova (4K context too small) β
β β’ FunctionGemma (loops infinitely) β
β β’ Qwopus-27B (doesn't fit 16GB) β
β β
β MINIMUM FOR MULTI-STEP AGENTS: ~4B active parameters β
β BFCL SCORE β AGENT CAPABILITY β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7 models tested on identical GUA_Blazor agent loop with Falcon Perception v2. All tests: navigate, search, extract, vision detect, form fill, captcha solve. Mac Mini M4 16GB, April 2026.