# Three-Model Comparison: Qwen3.5-9B vs Gemma4 E4B vs Bonsai-8B **Date:** 2026-04-07 | **Hardware:** Mac Mini M4 16GB | **Vision:** Falcon Perception v2 --- ## Test Results ``` ┌────┬───────────────────┬────────┬─────────────────┬─────────────────┬─────────────────┐ │ │ │ │ Qwen3.5-9B │ Gemma4 E4B │ Bonsai-8B │ │ # │ Task │ Diff. │ llama.cpp │ mlx_vlm+proxy │ prism llama.cpp │ ├────┼───────────────────┼────────┼─────────────────┼─────────────────┼─────────────────┤ │ 1 │ Wikipedia extract │ Easy │ 51s 1T ⚠️ │ 30s 4T ✅ │ 33s 1T ⚠️ │ │ 2 │ DDG search │ Medium │ 180s 6T ⚠️ │ 60s 6T ✅ │ 53s 16T ❌ │ │ 3 │ HN top story │ Easy │ 32s 2T ✅ │ 30s 2T ✅ │ 4s 1T ⚠️ │ │ 4 │ Cat vision (FP) │ Medium │ 34s 2T ✅ │ 40s 3T ✅ │ 5s 1T ❌ │ │ 5 │ Form filling │ Medium │ 237s 6T ✅ │ 60s 1T ❌ │ 4s 1T ❌ │ │ 6 │ reCAPTCHA │ Hard │ 300s 6T ⚠️ │ 120s 13T ⚠️ │ 4s 1T ❌ │ ├────┼───────────────────┼────────┼─────────────────┼─────────────────┼─────────────────┤ │ │ PASS / PARTIAL │ │ 3✅ 2⚠️ 1❌ │ 4✅ 1⚠️ 1❌ │ 0✅ 2⚠️ 4❌ │ │ │ Total time │ │ 834s │ 340s │ 103s │ │ │ Total tool calls │ │ 23 │ 29 │ 21 │ └────┴───────────────────┴────────┴─────────────────┴─────────────────┴─────────────────┘ T = tool calls | ✅ = pass | ⚠️ = partial | ❌ = fail ``` ## Speed Comparison ``` Generation speed (tok/s): Bonsai-8B ████████████████████████████████████████████████ 48.8 Gemma4 E4B ██████████████████████████████████ 35.0 Qwen3.5-9B ██████████ 10.0 Total suite time: Bonsai-8B ██████ 103s (1.7 min) Gemma4 E4B █████████████████████ 340s (5.7 min) Qwen3.5-9B █████████████████████████████████████████████████████ 834s (13.9 min) ``` ## Model Specs ``` ┌──────────────────┬──────────────┬──────────────┬──────────────┐ │ │ Qwen3.5-9B │ Gemma4 E4B │ Bonsai-8B │ ├──────────────────┼──────────────┼──────────────┼──────────────┤ │ Architecture │ Dense │ MoE (4B act) │ Dense 1-bit │ │ Total params │ 9B │ 9B │ 8B │ │ Active params │ 9B │ 4B │ 8B │ │ Quantization │ Q4_K_XL │ 4-bit MLX │ Q1_0_g128 │ │ Disk size │ 5.6 GB │ ~5 GB │ 1.15 GB │ │ Memory (peak) │ ~6.5 GB │ ~5.4 GB │ ~1.5 GB │ │ Generation tok/s │ ~10 │ ~35 │ ~49 │ │ Base model │ Qwen3.5 │ Gemma4 │ Qwen3 │ │ Vision │ ✅ mmproj │ ✅ native │ ❌ text only │ │ Tool calling │ ✅ native │ ✅ patched │ ✅ native │ │ Proxy needed │ ❌ │ ✅ │ ❌ │ │ Server │ llama.cpp │ mlx_vlm │ PrismML fork │ └──────────────────┴──────────────┴──────────────┴──────────────┘ ``` ## Task-by-Task Analysis ### T1: Wikipedia Extract (Easy) ``` Qwen: 1 tool (scrape_url), got some info ⚠️ Partial Gemma: 4 tools (navigate+extract), all 3 answers ✅ Pass Bonsai: 1 tool (navigate), stopped too early ⚠️ Partial ``` ### T2: DuckDuckGo Search (Medium) ``` Qwen: 6 tools, hit 180s timeout ⚠️ Partial (slow) Gemma: 6 tools (go+type+click+scroll+extract) ✅ Pass Bonsai: 16 tools but 21 idle turns, confused ❌ Fail ``` ### T3: HN Top Story (Easy) ``` Qwen: 2 tools, clean stop_loop ✅ Pass Gemma: 2 tools, found title ✅ Pass Bonsai: 1 tool, 4s but didn't extract properly ⚠️ Partial ``` ### T4: Cat Vision with Falcon (Medium) ``` Qwen: 2 tools incl vision_detect ✅ Pass Gemma: 3 tools, 2x vision_detect, found 6 cats ✅ Pass Bonsai: 1 tool, never called vision_detect ❌ Fail ``` ### T5: Form Filling (Medium) ``` Qwen: 6 tools (go+input+click x3), stop_loop ✅ Pass — ONLY WINNER Gemma: 1 tool, stuck thinking about form fields ❌ Fail Bonsai: 1 tool, only navigated ❌ Fail ``` ### T6: reCAPTCHA (Hard) ``` Qwen: 6 tools, 3 vision, 1 batch click, timeout ⚠️ Partial Gemma: 13 tools, 9 vision, 5 batch clicks, timeout ⚠️ Partial (most active) Bonsai: 1 tool, only navigated ❌ Fail ``` ## Scoring Matrix ``` ┌───────────────────┬───────┬───────┬───────┐ │ Task │ Qwen │ Gemma │Bonsai │ ├───────────────────┼───────┼───────┼───────┤ │ T1 Wikipedia │ 0.5 │ 1.0 │ 0.5 │ │ T2 DDG Search │ 0.5 │ 1.0 │ 0.0 │ │ T3 HN Story │ 1.0 │ 1.0 │ 0.5 │ │ T4 Cat Vision │ 1.0 │ 1.0 │ 0.0 │ │ T5 Form Fill │ 1.0 │ 0.0 │ 0.0 │ │ T6 reCAPTCHA │ 0.5 │ 0.5 │ 0.0 │ ├───────────────────┼───────┼───────┼───────┤ │ TOTAL │ 4.5 │ 4.5 │ 1.0 │ │ Speed factor │ 1x │ 3.5x │ 4.9x │ │ Score/minute │ 0.32 │ 0.79 │ 0.58 │ └───────────────────┴───────┴───────┴───────┘ Score/minute = total_score / suite_time_minutes ``` ## Memory Budget on 16GB ``` ┌──────────────────────┬──────────┬──────────┬──────────┐ │ Component │ +Qwen │ +Gemma4 │ +Bonsai │ ├──────────────────────┼──────────┼──────────┼──────────┤ │ LLM model │ 6.5 GB │ 5.4 GB │ 1.5 GB │ │ Falcon Perception │ 1.5 GB │ 1.5 GB │ 1.5 GB │ │ GUA_Blazor + Browser │ 0.8 GB │ 0.8 GB │ 0.8 GB │ │ Proxy │ — │ 0.1 GB │ — │ │ OS + system │ 3.0 GB │ 3.0 GB │ 3.0 GB │ ├──────────────────────┼──────────┼──────────┼──────────┤ │ TOTAL │ 11.8 GB │ 10.8 GB │ 6.8 GB │ │ HEADROOM │ 4.2 GB │ 5.2 GB │ 9.2 GB ★ │ └──────────────────────┴──────────┴──────────┴──────────┘ Bonsai leaves 9.2 GB free — enough to run ANOTHER model simultaneously! ``` ## Verdict ``` ┌────────────────────────────────────────────────────────────────┐ │ │ │ 🏆 BEST OVERALL: Gemma4 E4B │ │ 4.5 score, 3.5x speed, best score/minute (0.79) │ │ Wins: Wikipedia, DDG search, HN, Cat vision │ │ │ │ 🥈 MOST RELIABLE: Qwen3.5-9B │ │ 4.5 score, slowest but ONLY model that fills forms │ │ Wins: Form filling (exclusive), HN, Cat vision │ │ │ │ 🥉 FASTEST BUT WEAKEST: Bonsai-8B │ │ 1.0 score, 4.9x speed, but can't do multi-step tasks │ │ Only 1.15 GB — could pair with another model │ │ │ │ IDEAL SETUP: │ │ Bonsai-8B (1.5 GB) + Qwen3.5-9B (6.5 GB) = 8 GB │ │ Route simple→Bonsai, complex→Qwen │ │ Both fit on 16GB simultaneously! │ │ │ │ OR: Gemma4 E4B alone (best score/minute, most versatile) │ │ │ └────────────────────────────────────────────────────────────────┘ ``` --- *All tests run on Mac Mini M4 16GB with identical GUA_Blazor agent loop,* *Falcon Perception v2 vision backend, and same prompts.*