# Final 7-Model Comparison — Agent Task Performance on Mac Mini M4 16GB **Date:** 2026-04-07 | **Hardware:** Mac Mini M4 16GB (Dyson) --- ## Results At a Glance ``` Model Size Speed Agent Fit Verdict (GB) (tok/s) Score 16GB? ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Qwopus3.5-27B Q3 11.0 OOM N/A ❌ Too big — needs 24GB+ Qwen3.5-9B Q4 5.6 10 4.5/6 ✅ Most reliable agent Gemma4 E4B 4bit 5.0 35 4.5/6 ✅ Fastest viable agent Bonsai-8B 1bit 1.15 49 1.0/6 ✅ Single-turn only LFM2.5-Nova 1.2B 0.70 118 0.0/6 ✅ Context too small (4K) FunctionGemma 270M 0.28 197 0.0/6 ✅ Repeats infinitely Qwopus3.5-27B Q3 11.0 OOM N/A ❌ OOM at any context ``` ## Detailed Comparison ``` ┌─────────────────────┬────────┬────────┬───────┬───────┬────────┬───────┬──────┐ │ Model │Params │Disk │Memory │Gen │Agent │Multi │Proxy │ │ │(active)│(GB) │(GB) │tok/s │Score │-step? │Need? │ ├─────────────────────┼────────┼────────┼───────┼───────┼────────┼───────┼──────┤ │ Qwopus3.5-27B v3 Q3 │ 27B │ 11.0 │ 14+ │ OOM │ N/A │ ? │ No │ │ Qwen3.5-9B Q4_K_XL │ 9B │ 5.6 │ 6.5 │ 10 │ 4.5/6 │ ✅ │ No │ │ Gemma4 E4B 4bit │ 4B MoE │ 5.0 │ 5.4 │ 35 │ 4.5/6 │ ✅ │ Yes │ │ Bonsai-8B Q1_0 │ 8B │ 1.15 │ 1.5 │ 49 │ 1.0/6 │ ❌ │ No │ │ LFM2.5-Nova 1.2B Q4 │ 1.2B │ 0.70 │ 0.8 │ 118 │ 0.0/6 │ ❌ │ No │ │ FunctionGemma 270M │ 270M │ 0.28 │ 0.3 │ 197 │ 0.0/6 │ ❌ │ No │ └─────────────────────┴────────┴────────┴───────┴───────┴────────┴───────┴──────┘ ``` ## Speed vs Capability Cliff ``` Agent Score (6 = perfect) 5 │ ★ Qwen (10 tok/s) ★ Gemma4 (35 tok/s) │ 4 │ │ 3 │ │ 2 │ │ 1 │ ★ Bonsai (49 tok/s) │ 0 │ ✕ Qwopus ★ LFM2.5 ★ FuncGemma │ (OOM) (118 tok/s) (197 tok/s) └──┬──────────┬──────────┬──────────┬──────────┬──────────┬─── 0 10 35 50 118 197 Generation Speed (tok/s) ★ = tested ✕ = OOM THE CLIFF: There is a hard capability cliff between 4B active params and below. Models under 4B active params cannot do multi-step agent tasks regardless of how fast they are. ``` ## Task Breakdown (6 tests) ``` ┌────┬─────────────────┬────┬──────────┬──────────┬──────────┬──────────┬──────────┐ │ # │ Task │Diff│ Qwen 9B │Gemma4 E4B│Bonsai 8B │ LFM 1.2B │FuncG 270M│ ├────┼─────────────────┼────┼──────────┼──────────┼──────────┼──────────┼──────────┤ │ 1 │ Wikipedia info │ E │ ⚠️ 1T │ ✅ 4T │ ⚠️ 1T │ ❌ OOC │ ❌ Loop │ │ 2 │ DDG search │ M │ ⚠️ 6T │ ✅ 6T │ ❌ 16T │ ❌ OOC │ ❌ Loop │ │ 3 │ HN top story │ E │ ✅ 2T │ ✅ 2T │ ⚠️ 1T │ ❌ 0T │ ❌ Loop │ │ 4 │ Cat vision (FP) │ M │ ✅ 2T │ ✅ 3T │ ❌ 1T │ ❌ 0T │ ❌ Loop │ │ 5 │ Form filling │ M │ ✅ 6T │ ❌ 1T │ ❌ 1T │ ❌ 0T │ ❌ Loop │ │ 6 │ reCAPTCHA │ H │ ⚠️ 6T │ ⚠️ 13T │ ❌ 1T │ ❌ 0T │ ❌ Loop │ ├────┼─────────────────┼────┼──────────┼──────────┼──────────┼──────────┼──────────┤ │ │ Score │ │ 4.5/6 │ 4.5/6 │ 1.0/6 │ 0.0/6 │ 0.0/6 │ │ │ Total time │ │ 834s │ 340s │ 103s │ ~5s │ ~5s │ │ │ Total tools │ │ 23 │ 29 │ 21 │ 0 │ 0 │ └────┴─────────────────┴────┴──────────┴──────────┴──────────┴──────────┴──────────┘ T = tool calls | E = Easy | M = Medium | H = Hard OOC = Out of Context | Loop = infinite repetition ``` ## Why Small Models Fail at Agent Tasks ``` ┌─────────────────────────────────────────────────────────────────────┐ │ THE AGENT CAPABILITY LADDER │ │ │ │ Level 1: FORMAT A TOOL CALL │ │ └─ Can generate {"name": "func", "arguments": {...}} │ │ └─ ALL models pass this (even 270M at 197 tok/s) │ │ └─ This is what BFCL benchmarks measure │ │ │ │ Level 2: UNDERSTAND TOOL RESULTS │ │ └─ Read tool output and decide next action │ │ └─ Requires: context understanding, error handling │ │ └─ Bonsai-8B fails here (makes 1 call then stops) │ │ │ │ Level 3: CHAIN MULTIPLE TOOLS │ │ └─ Navigate → type → click → extract → report │ │ └─ Requires: planning, sequential reasoning, 5K+ context │ │ └─ Minimum ~4B active params (Gemma4 E4B) │ │ │ │ Level 4: HANDLE ERRORS AND ADAPT │ │ └─ Tool fails → try different approach → recover │ │ └─ Requires: robust reasoning, error patterns │ │ └─ Qwen 9B reliable, Gemma4 E4B partial │ │ │ │ Level 5: COMPLEX MULTI-FIELD INTERACTION │ │ └─ Fill forms, interact with dynamic UIs │ │ └─ Requires: deep context, field mapping, DOM understanding │ │ └─ Only Qwen 9B succeeds (form filling) │ │ └─ Likely needs 27B+ for consistent success │ │ │ │ BFCL = Level 1 only. Our tests = Levels 1-5. │ │ That's why Bonsai scores 73% on BFCL but 1/6 on our tests. │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Memory Map on 16GB ``` Available: 16 GB Qwopus 27B Q3: ████████████████████████████████████████████████████████████████ 14+ GB → ❌ OOM ▓▓▓▓ OS (3GB) Qwen 9B + Falcon: ████████████████████████████ 6.5 GB model ██████ 1.5 GB Falcon ██ 0.8 GB GUA+Browser ▓▓▓▓ OS (3GB) ░░░░░ 4.2 GB free ✅ Gemma4 E4B + Falcon: ██████████████████████ 5.4 GB model ██████ 1.5 GB Falcon ██ 0.8 GB GUA+Browser+Proxy ▓▓▓▓ OS (3GB) ░░░░░░░ 5.2 GB free ✅ (most headroom) Bonsai 8B + Qwen 9B (dual!): █████ 1.5 GB Bonsai ████████████████████████████ 6.5 GB Qwen ██████ 1.5 GB Falcon ██ 0.8 GB GUA+Browser ▓▓▓▓ OS (3GB) ░░ 2.7 GB free ✅ (tight but fits!) ``` ## Final Verdict ``` ┌────────────────────────────────────────────────────────────────────┐ │ │ │ 🏆 BEST OVERALL: Gemma4 E4B (with proxy fixes) │ │ • 4.5/6 score, 35 tok/s, 5.4 GB │ │ • Best speed-to-capability ratio │ │ • Wins DDG search, Wikipedia, HN, vision tasks │ │ • Needs proxy (7 fixes) but works reliably │ │ │ │ 🥈 MOST RELIABLE: Qwen3.5-9B │ │ • 4.5/6 score, 10 tok/s, 6.5 GB │ │ • ONLY model that fills forms successfully │ │ • No proxy needed, native tool calling │ │ • Slower but handles edge cases better │ │ │ │ 🥉 HONORABLE: Bonsai-8B │ │ • 1.0/6 but only 1.15 GB — fits alongside any other model │ │ • Could serve as fast first-call router │ │ │ │ ❌ DON'T USE FOR AGENTS: │ │ • LFM2.5-Nova (4K context too small) │ │ • FunctionGemma (loops infinitely) │ │ • Qwopus-27B (doesn't fit 16GB) │ │ │ │ MINIMUM FOR MULTI-STEP AGENTS: ~4B active parameters │ │ BFCL SCORE ≠ AGENT CAPABILITY │ │ │ └────────────────────────────────────────────────────────────────────┘ ``` --- *7 models tested on identical GUA_Blazor agent loop with Falcon Perception v2.* *All tests: navigate, search, extract, vision detect, form fill, captcha solve.* *Mac Mini M4 16GB, April 2026.*