Final 7-Model Comparison — Agent Task Performance on Mac Mini M4 16GB

Date: 2026-04-07 | Hardware: Mac Mini M4 16GB (Dyson)

Results At a Glance

Model               Size    Speed    Agent    Fit     Verdict
                    (GB)    (tok/s)  Score    16GB?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Qwopus3.5-27B Q3    11.0     OOM     N/A      ❌      Too big — needs 24GB+
Qwen3.5-9B Q4       5.6      10     4.5/6     ✅      Most reliable agent
Gemma4 E4B 4bit     5.0      35     4.5/6     ✅      Fastest viable agent
Bonsai-8B 1bit      1.15     49     1.0/6     ✅      Single-turn only
LFM2.5-Nova 1.2B    0.70    118     0.0/6     ✅      Context too small (4K)
FunctionGemma 270M   0.28    197     0.0/6     ✅      Repeats infinitely
Qwopus3.5-27B Q3    11.0     OOM     N/A      ❌      OOM at any context

Detailed Comparison

┌─────────────────────┬────────┬────────┬───────┬───────┬────────┬───────┬──────┐
│ Model               │Params  │Disk    │Memory │Gen    │Agent   │Multi  │Proxy │
│                     │(active)│(GB)    │(GB)   │tok/s  │Score   │-step? │Need? │
├─────────────────────┼────────┼────────┼───────┼───────┼────────┼───────┼──────┤
│ Qwopus3.5-27B v3 Q3 │ 27B    │ 11.0   │ 14+   │ OOM   │  N/A   │  ?    │  No  │
│ Qwen3.5-9B Q4_K_XL  │ 9B     │  5.6   │  6.5  │  10   │ 4.5/6  │  ✅   │  No  │
│ Gemma4 E4B 4bit     │ 4B MoE │  5.0   │  5.4  │  35   │ 4.5/6  │  ✅   │  Yes │
│ Bonsai-8B Q1_0      │ 8B     │  1.15  │  1.5  │  49   │ 1.0/6  │  ❌   │  No  │
│ LFM2.5-Nova 1.2B Q4 │ 1.2B   │  0.70  │  0.8  │ 118   │ 0.0/6  │  ❌   │  No  │
│ FunctionGemma 270M  │ 270M   │  0.28  │  0.3  │ 197   │ 0.0/6  │  ❌   │  No  │
└─────────────────────┴────────┴────────┴───────┴───────┴────────┴───────┴──────┘

Speed vs Capability Cliff

Agent Score (6 = perfect)

 5 │          ★ Qwen (10 tok/s)    ★ Gemma4 (35 tok/s)
   │
 4 │
   │
 3 │
   │
 2 │
   │
 1 │                                    ★ Bonsai (49 tok/s)
   │
 0 │  ✕ Qwopus                              ★ LFM2.5    ★ FuncGemma
   │  (OOM)                                (118 tok/s)   (197 tok/s)
   └──┬──────────┬──────────┬──────────┬──────────┬──────────┬───
     0         10         35         50        118        197
                        Generation Speed (tok/s)

  ★ = tested   ✕ = OOM

  THE CLIFF: There is a hard capability cliff between 4B active params
  and below. Models under 4B active params cannot do multi-step agent
  tasks regardless of how fast they are.

Task Breakdown (6 tests)

┌────┬─────────────────┬────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│  # │ Task            │Diff│ Qwen 9B  │Gemma4 E4B│Bonsai 8B │ LFM 1.2B │FuncG 270M│
├────┼─────────────────┼────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│  1 │ Wikipedia info   │ E  │  ⚠️ 1T   │  ✅ 4T   │  ⚠️ 1T   │  ❌ OOC  │  ❌ Loop  │
│  2 │ DDG search       │ M  │  ⚠️ 6T   │  ✅ 6T   │  ❌ 16T  │  ❌ OOC  │  ❌ Loop  │
│  3 │ HN top story     │ E  │  ✅ 2T   │  ✅ 2T   │  ⚠️ 1T   │  ❌ 0T   │  ❌ Loop  │
│  4 │ Cat vision (FP)  │ M  │  ✅ 2T   │  ✅ 3T   │  ❌ 1T   │  ❌ 0T   │  ❌ Loop  │
│  5 │ Form filling     │ M  │  ✅ 6T   │  ❌ 1T   │  ❌ 1T   │  ❌ 0T   │  ❌ Loop  │
│  6 │ reCAPTCHA        │ H  │  ⚠️ 6T   │  ⚠️ 13T  │  ❌ 1T   │  ❌ 0T   │  ❌ Loop  │
├────┼─────────────────┼────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│    │ Score           │    │ 4.5/6    │ 4.5/6    │ 1.0/6    │ 0.0/6    │ 0.0/6    │
│    │ Total time      │    │ 834s     │ 340s     │ 103s     │ ~5s      │ ~5s      │
│    │ Total tools     │    │ 23       │ 29       │ 21       │ 0        │ 0        │
└────┴─────────────────┴────┴──────────┴──────────┴──────────┴──────────┴──────────┘

T = tool calls | E = Easy | M = Medium | H = Hard
OOC = Out of Context | Loop = infinite repetition

Why Small Models Fail at Agent Tasks

┌─────────────────────────────────────────────────────────────────────┐
│                    THE AGENT CAPABILITY LADDER                       │
│                                                                     │
│  Level 1: FORMAT A TOOL CALL                                        │
│  └─ Can generate {"name": "func", "arguments": {...}}              │
│  └─ ALL models pass this (even 270M at 197 tok/s)                  │
│  └─ This is what BFCL benchmarks measure                           │
│                                                                     │
│  Level 2: UNDERSTAND TOOL RESULTS                                   │
│  └─ Read tool output and decide next action                        │
│  └─ Requires: context understanding, error handling                │
│  └─ Bonsai-8B fails here (makes 1 call then stops)                │
│                                                                     │
│  Level 3: CHAIN MULTIPLE TOOLS                                      │
│  └─ Navigate → type → click → extract → report                    │
│  └─ Requires: planning, sequential reasoning, 5K+ context         │
│  └─ Minimum ~4B active params (Gemma4 E4B)                        │
│                                                                     │
│  Level 4: HANDLE ERRORS AND ADAPT                                   │
│  └─ Tool fails → try different approach → recover                  │
│  └─ Requires: robust reasoning, error patterns                     │
│  └─ Qwen 9B reliable, Gemma4 E4B partial                          │
│                                                                     │
│  Level 5: COMPLEX MULTI-FIELD INTERACTION                           │
│  └─ Fill forms, interact with dynamic UIs                          │
│  └─ Requires: deep context, field mapping, DOM understanding       │
│  └─ Only Qwen 9B succeeds (form filling)                          │
│  └─ Likely needs 27B+ for consistent success                       │
│                                                                     │
│  BFCL = Level 1 only. Our tests = Levels 1-5.                     │
│  That's why Bonsai scores 73% on BFCL but 1/6 on our tests.       │
└─────────────────────────────────────────────────────────────────────┘

Memory Map on 16GB

Available: 16 GB

Qwopus 27B Q3:
  ████████████████████████████████████████████████████████████████ 14+ GB → ❌ OOM
  ▓▓▓▓ OS (3GB)

Qwen 9B + Falcon:
  ████████████████████████████ 6.5 GB model
  ██████ 1.5 GB Falcon
  ██ 0.8 GB GUA+Browser
  ▓▓▓▓ OS (3GB)
  ░░░░░ 4.2 GB free ✅

Gemma4 E4B + Falcon:
  ██████████████████████ 5.4 GB model
  ██████ 1.5 GB Falcon
  ██ 0.8 GB GUA+Browser+Proxy
  ▓▓▓▓ OS (3GB)
  ░░░░░░░ 5.2 GB free ✅ (most headroom)

Bonsai 8B + Qwen 9B (dual!):
  █████ 1.5 GB Bonsai
  ████████████████████████████ 6.5 GB Qwen
  ██████ 1.5 GB Falcon
  ██ 0.8 GB GUA+Browser
  ▓▓▓▓ OS (3GB)
  ░░ 2.7 GB free ✅ (tight but fits!)

Final Verdict

┌────────────────────────────────────────────────────────────────────┐
│                                                                    │
│  🏆 BEST OVERALL: Gemma4 E4B (with proxy fixes)                  │
│     • 4.5/6 score, 35 tok/s, 5.4 GB                              │
│     • Best speed-to-capability ratio                               │
│     • Wins DDG search, Wikipedia, HN, vision tasks                │
│     • Needs proxy (7 fixes) but works reliably                    │
│                                                                    │
│  🥈 MOST RELIABLE: Qwen3.5-9B                                    │
│     • 4.5/6 score, 10 tok/s, 6.5 GB                              │
│     • ONLY model that fills forms successfully                    │
│     • No proxy needed, native tool calling                         │
│     • Slower but handles edge cases better                        │
│                                                                    │
│  🥉 HONORABLE: Bonsai-8B                                          │
│     • 1.0/6 but only 1.15 GB — fits alongside any other model    │
│     • Could serve as fast first-call router                       │
│                                                                    │
│  ❌ DON'T USE FOR AGENTS:                                         │
│     • LFM2.5-Nova (4K context too small)                          │
│     • FunctionGemma (loops infinitely)                             │
│     • Qwopus-27B (doesn't fit 16GB)                               │
│                                                                    │
│  MINIMUM FOR MULTI-STEP AGENTS: ~4B active parameters             │
│  BFCL SCORE ≠ AGENT CAPABILITY                                    │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

7 models tested on identical GUA_Blazor agent loop with Falcon Perception v2. All tests: navigate, search, extract, vision detect, form fill, captcha solve. Mac Mini M4 16GB, April 2026.