Three-Model Comparison: Qwen3.5-9B vs Gemma4 E4B vs Bonsai-8B

Date: 2026-04-07 | Hardware: Mac Mini M4 16GB | Vision: Falcon Perception v2

Test Results

┌────┬───────────────────┬────────┬─────────────────┬─────────────────┬─────────────────┐
│    │                   │        │   Qwen3.5-9B    │   Gemma4 E4B    │   Bonsai-8B     │
│  # │ Task              │ Diff.  │ llama.cpp       │ mlx_vlm+proxy   │ prism llama.cpp │
├────┼───────────────────┼────────┼─────────────────┼─────────────────┼─────────────────┤
│  1 │ Wikipedia extract  │ Easy   │  51s  1T  ⚠️    │  30s  4T  ✅    │  33s  1T  ⚠️    │
│  2 │ DDG search         │ Medium │ 180s  6T  ⚠️    │  60s  6T  ✅    │  53s 16T  ❌    │
│  3 │ HN top story       │ Easy   │  32s  2T  ✅    │  30s  2T  ✅    │   4s  1T  ⚠️    │
│  4 │ Cat vision (FP)    │ Medium │  34s  2T  ✅    │  40s  3T  ✅    │   5s  1T  ❌    │
│  5 │ Form filling       │ Medium │ 237s  6T  ✅    │  60s  1T  ❌    │   4s  1T  ❌    │
│  6 │ reCAPTCHA          │ Hard   │ 300s  6T  ⚠️    │ 120s 13T  ⚠️    │   4s  1T  ❌    │
├────┼───────────────────┼────────┼─────────────────┼─────────────────┼─────────────────┤
│    │ PASS / PARTIAL     │        │  3✅ 2⚠️ 1❌    │  4✅ 1⚠️ 1❌    │  0✅ 2⚠️ 4❌    │
│    │ Total time         │        │     834s        │     340s        │     103s        │
│    │ Total tool calls   │        │      23         │      29         │      21         │
└────┴───────────────────┴────────┴─────────────────┴─────────────────┴─────────────────┘

T = tool calls | ✅ = pass | ⚠️ = partial | ❌ = fail

Speed Comparison

Generation speed (tok/s):

Bonsai-8B    ████████████████████████████████████████████████  48.8
Gemma4 E4B   ██████████████████████████████████               35.0
Qwen3.5-9B   ██████████                                       10.0

Total suite time:

Bonsai-8B    ██████                                  103s (1.7 min)
Gemma4 E4B   █████████████████████                   340s (5.7 min)
Qwen3.5-9B   █████████████████████████████████████████████████████  834s (13.9 min)

Model Specs

┌──────────────────┬──────────────┬──────────────┬──────────────┐
│                  │ Qwen3.5-9B   │ Gemma4 E4B   │ Bonsai-8B    │
├──────────────────┼──────────────┼──────────────┼──────────────┤
│ Architecture     │ Dense        │ MoE (4B act) │ Dense 1-bit  │
│ Total params     │ 9B           │ 9B           │ 8B           │
│ Active params    │ 9B           │ 4B           │ 8B           │
│ Quantization     │ Q4_K_XL      │ 4-bit MLX    │ Q1_0_g128    │
│ Disk size        │ 5.6 GB       │ ~5 GB        │ 1.15 GB      │
│ Memory (peak)    │ ~6.5 GB      │ ~5.4 GB      │ ~1.5 GB      │
│ Generation tok/s │ ~10          │ ~35          │ ~49          │
│ Base model       │ Qwen3.5      │ Gemma4       │ Qwen3        │
│ Vision           │ ✅ mmproj     │ ✅ native     │ ❌ text only  │
│ Tool calling     │ ✅ native     │ ✅ patched    │ ✅ native     │
│ Proxy needed     │ ❌            │ ✅            │ ❌            │
│ Server           │ llama.cpp    │ mlx_vlm      │ PrismML fork │
└──────────────────┴──────────────┴──────────────┴──────────────┘

Task-by-Task Analysis

T1: Wikipedia Extract (Easy)

Qwen:  1 tool (scrape_url), got some info          ⚠️ Partial
Gemma: 4 tools (navigate+extract), all 3 answers   ✅ Pass
Bonsai: 1 tool (navigate), stopped too early        ⚠️ Partial

T2: DuckDuckGo Search (Medium)

Qwen:  6 tools, hit 180s timeout                   ⚠️ Partial (slow)
Gemma: 6 tools (go+type+click+scroll+extract)      ✅ Pass
Bonsai: 16 tools but 21 idle turns, confused        ❌ Fail

T3: HN Top Story (Easy)

Qwen:  2 tools, clean stop_loop                    ✅ Pass
Gemma: 2 tools, found title                        ✅ Pass
Bonsai: 1 tool, 4s but didn't extract properly     ⚠️ Partial

T4: Cat Vision with Falcon (Medium)

Qwen:  2 tools incl vision_detect                  ✅ Pass
Gemma: 3 tools, 2x vision_detect, found 6 cats     ✅ Pass
Bonsai: 1 tool, never called vision_detect          ❌ Fail

T5: Form Filling (Medium)

Qwen:  6 tools (go+input+click x3), stop_loop      ✅ Pass — ONLY WINNER
Gemma: 1 tool, stuck thinking about form fields     ❌ Fail
Bonsai: 1 tool, only navigated                      ❌ Fail

T6: reCAPTCHA (Hard)

Qwen:  6 tools, 3 vision, 1 batch click, timeout   ⚠️ Partial
Gemma: 13 tools, 9 vision, 5 batch clicks, timeout ⚠️ Partial (most active)
Bonsai: 1 tool, only navigated                      ❌ Fail

Scoring Matrix

┌───────────────────┬───────┬───────┬───────┐
│ Task              │ Qwen  │ Gemma │Bonsai │
├───────────────────┼───────┼───────┼───────┤
│ T1 Wikipedia      │  0.5  │  1.0  │  0.5  │
│ T2 DDG Search     │  0.5  │  1.0  │  0.0  │
│ T3 HN Story       │  1.0  │  1.0  │  0.5  │
│ T4 Cat Vision     │  1.0  │  1.0  │  0.0  │
│ T5 Form Fill      │  1.0  │  0.0  │  0.0  │
│ T6 reCAPTCHA      │  0.5  │  0.5  │  0.0  │
├───────────────────┼───────┼───────┼───────┤
│ TOTAL             │  4.5  │  4.5  │  1.0  │
│ Speed factor      │  1x   │  3.5x │  4.9x │
│ Score/minute      │  0.32 │  0.79 │  0.58 │
└───────────────────┴───────┴───────┴───────┘

Score/minute = total_score / suite_time_minutes

Memory Budget on 16GB

┌──────────────────────┬──────────┬──────────┬──────────┐
│ Component            │ +Qwen    │ +Gemma4  │ +Bonsai  │
├──────────────────────┼──────────┼──────────┼──────────┤
│ LLM model            │ 6.5 GB   │ 5.4 GB   │ 1.5 GB   │
│ Falcon Perception    │ 1.5 GB   │ 1.5 GB   │ 1.5 GB   │
│ GUA_Blazor + Browser │ 0.8 GB   │ 0.8 GB   │ 0.8 GB   │
│ Proxy                │ —        │ 0.1 GB   │ —        │
│ OS + system          │ 3.0 GB   │ 3.0 GB   │ 3.0 GB   │
├──────────────────────┼──────────┼──────────┼──────────┤
│ TOTAL                │ 11.8 GB  │ 10.8 GB  │ 6.8 GB   │
│ HEADROOM             │ 4.2 GB   │ 5.2 GB   │ 9.2 GB ★ │
└──────────────────────┴──────────┴──────────┴──────────┘

Bonsai leaves 9.2 GB free — enough to run ANOTHER model simultaneously!

Verdict

┌────────────────────────────────────────────────────────────────┐
│                                                                │
│  🏆 BEST OVERALL: Gemma4 E4B                                  │
│     4.5 score, 3.5x speed, best score/minute (0.79)          │
│     Wins: Wikipedia, DDG search, HN, Cat vision              │
│                                                                │
│  🥈 MOST RELIABLE: Qwen3.5-9B                                │
│     4.5 score, slowest but ONLY model that fills forms       │
│     Wins: Form filling (exclusive), HN, Cat vision           │
│                                                                │
│  🥉 FASTEST BUT WEAKEST: Bonsai-8B                           │
│     1.0 score, 4.9x speed, but can't do multi-step tasks    │
│     Only 1.15 GB — could pair with another model             │
│                                                                │
│  IDEAL SETUP:                                                  │
│  Bonsai-8B (1.5 GB) + Qwen3.5-9B (6.5 GB) = 8 GB            │
│  Route simple→Bonsai, complex→Qwen                            │
│  Both fit on 16GB simultaneously!                              │
│                                                                │
│  OR: Gemma4 E4B alone (best score/minute, most versatile)     │
│                                                                │
└────────────────────────────────────────────────────────────────┘

All tests run on Mac Mini M4 16GB with identical GUA_Blazor agent loop, Falcon Perception v2 vision backend, and same prompts.