File size: 12,148 Bytes
221ca5c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | # Final 7-Model Comparison β Agent Task Performance on Mac Mini M4 16GB
**Date:** 2026-04-07 | **Hardware:** Mac Mini M4 16GB (Dyson)
---
## Results At a Glance
```
Model Size Speed Agent Fit Verdict
(GB) (tok/s) Score 16GB?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Qwopus3.5-27B Q3 11.0 OOM N/A β Too big β needs 24GB+
Qwen3.5-9B Q4 5.6 10 4.5/6 β
Most reliable agent
Gemma4 E4B 4bit 5.0 35 4.5/6 β
Fastest viable agent
Bonsai-8B 1bit 1.15 49 1.0/6 β
Single-turn only
LFM2.5-Nova 1.2B 0.70 118 0.0/6 β
Context too small (4K)
FunctionGemma 270M 0.28 197 0.0/6 β
Repeats infinitely
Qwopus3.5-27B Q3 11.0 OOM N/A β OOM at any context
```
## Detailed Comparison
```
βββββββββββββββββββββββ¬βββββββββ¬βββββββββ¬ββββββββ¬ββββββββ¬βββββββββ¬ββββββββ¬βββββββ
β Model βParams βDisk βMemory βGen βAgent βMulti βProxy β
β β(active)β(GB) β(GB) βtok/s βScore β-step? βNeed? β
βββββββββββββββββββββββΌβββββββββΌβββββββββΌββββββββΌββββββββΌβββββββββΌββββββββΌβββββββ€
β Qwopus3.5-27B v3 Q3 β 27B β 11.0 β 14+ β OOM β N/A β ? β No β
β Qwen3.5-9B Q4_K_XL β 9B β 5.6 β 6.5 β 10 β 4.5/6 β β
β No β
β Gemma4 E4B 4bit β 4B MoE β 5.0 β 5.4 β 35 β 4.5/6 β β
β Yes β
β Bonsai-8B Q1_0 β 8B β 1.15 β 1.5 β 49 β 1.0/6 β β β No β
β LFM2.5-Nova 1.2B Q4 β 1.2B β 0.70 β 0.8 β 118 β 0.0/6 β β β No β
β FunctionGemma 270M β 270M β 0.28 β 0.3 β 197 β 0.0/6 β β β No β
βββββββββββββββββββββββ΄βββββββββ΄βββββββββ΄ββββββββ΄ββββββββ΄βββββββββ΄ββββββββ΄βββββββ
```
## Speed vs Capability Cliff
```
Agent Score (6 = perfect)
5 β β
Qwen (10 tok/s) β
Gemma4 (35 tok/s)
β
4 β
β
3 β
β
2 β
β
1 β β
Bonsai (49 tok/s)
β
0 β β Qwopus β
LFM2.5 β
FuncGemma
β (OOM) (118 tok/s) (197 tok/s)
ββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββ
0 10 35 50 118 197
Generation Speed (tok/s)
β
= tested β = OOM
THE CLIFF: There is a hard capability cliff between 4B active params
and below. Models under 4B active params cannot do multi-step agent
tasks regardless of how fast they are.
```
## Task Breakdown (6 tests)
```
ββββββ¬ββββββββββββββββββ¬βββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β # β Task βDiffβ Qwen 9B βGemma4 E4BβBonsai 8B β LFM 1.2B βFuncG 270Mβ
ββββββΌββββββββββββββββββΌβββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β 1 β Wikipedia info β E β β οΈ 1T β β
4T β β οΈ 1T β β OOC β β Loop β
β 2 β DDG search β M β β οΈ 6T β β
6T β β 16T β β OOC β β Loop β
β 3 β HN top story β E β β
2T β β
2T β β οΈ 1T β β 0T β β Loop β
β 4 β Cat vision (FP) β M β β
2T β β
3T β β 1T β β 0T β β Loop β
β 5 β Form filling β M β β
6T β β 1T β β 1T β β 0T β β Loop β
β 6 β reCAPTCHA β H β β οΈ 6T β β οΈ 13T β β 1T β β 0T β β Loop β
ββββββΌββββββββββββββββββΌβββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β β Score β β 4.5/6 β 4.5/6 β 1.0/6 β 0.0/6 β 0.0/6 β
β β Total time β β 834s β 340s β 103s β ~5s β ~5s β
β β Total tools β β 23 β 29 β 21 β 0 β 0 β
ββββββ΄ββββββββββββββββββ΄βββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
T = tool calls | E = Easy | M = Medium | H = Hard
OOC = Out of Context | Loop = infinite repetition
```
## Why Small Models Fail at Agent Tasks
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE AGENT CAPABILITY LADDER β
β β
β Level 1: FORMAT A TOOL CALL β
β ββ Can generate {"name": "func", "arguments": {...}} β
β ββ ALL models pass this (even 270M at 197 tok/s) β
β ββ This is what BFCL benchmarks measure β
β β
β Level 2: UNDERSTAND TOOL RESULTS β
β ββ Read tool output and decide next action β
β ββ Requires: context understanding, error handling β
β ββ Bonsai-8B fails here (makes 1 call then stops) β
β β
β Level 3: CHAIN MULTIPLE TOOLS β
β ββ Navigate β type β click β extract β report β
β ββ Requires: planning, sequential reasoning, 5K+ context β
β ββ Minimum ~4B active params (Gemma4 E4B) β
β β
β Level 4: HANDLE ERRORS AND ADAPT β
β ββ Tool fails β try different approach β recover β
β ββ Requires: robust reasoning, error patterns β
β ββ Qwen 9B reliable, Gemma4 E4B partial β
β β
β Level 5: COMPLEX MULTI-FIELD INTERACTION β
β ββ Fill forms, interact with dynamic UIs β
β ββ Requires: deep context, field mapping, DOM understanding β
β ββ Only Qwen 9B succeeds (form filling) β
β ββ Likely needs 27B+ for consistent success β
β β
β BFCL = Level 1 only. Our tests = Levels 1-5. β
β That's why Bonsai scores 73% on BFCL but 1/6 on our tests. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Memory Map on 16GB
```
Available: 16 GB
Qwopus 27B Q3:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 14+ GB β β OOM
ββββ OS (3GB)
Qwen 9B + Falcon:
ββββββββββββββββββββββββββββ 6.5 GB model
ββββββ 1.5 GB Falcon
ββ 0.8 GB GUA+Browser
ββββ OS (3GB)
βββββ 4.2 GB free β
Gemma4 E4B + Falcon:
ββββββββββββββββββββββ 5.4 GB model
ββββββ 1.5 GB Falcon
ββ 0.8 GB GUA+Browser+Proxy
ββββ OS (3GB)
βββββββ 5.2 GB free β
(most headroom)
Bonsai 8B + Qwen 9B (dual!):
βββββ 1.5 GB Bonsai
ββββββββββββββββββββββββββββ 6.5 GB Qwen
ββββββ 1.5 GB Falcon
ββ 0.8 GB GUA+Browser
ββββ OS (3GB)
ββ 2.7 GB free β
(tight but fits!)
```
## Final Verdict
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π BEST OVERALL: Gemma4 E4B (with proxy fixes) β
β β’ 4.5/6 score, 35 tok/s, 5.4 GB β
β β’ Best speed-to-capability ratio β
β β’ Wins DDG search, Wikipedia, HN, vision tasks β
β β’ Needs proxy (7 fixes) but works reliably β
β β
β π₯ MOST RELIABLE: Qwen3.5-9B β
β β’ 4.5/6 score, 10 tok/s, 6.5 GB β
β β’ ONLY model that fills forms successfully β
β β’ No proxy needed, native tool calling β
β β’ Slower but handles edge cases better β
β β
β π₯ HONORABLE: Bonsai-8B β
β β’ 1.0/6 but only 1.15 GB β fits alongside any other model β
β β’ Could serve as fast first-call router β
β β
β β DON'T USE FOR AGENTS: β
β β’ LFM2.5-Nova (4K context too small) β
β β’ FunctionGemma (loops infinitely) β
β β’ Qwopus-27B (doesn't fit 16GB) β
β β
β MINIMUM FOR MULTI-STEP AGENTS: ~4B active parameters β
β BFCL SCORE β AGENT CAPABILITY β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
*7 models tested on identical GUA_Blazor agent loop with Falcon Perception v2.*
*All tests: navigate, search, extract, vision detect, form fill, captcha solve.*
*Mac Mini M4 16GB, April 2026.*
|