Three-Model Comparison: Qwen3.5-9B vs Gemma4 E4B vs Bonsai-8B
Date: 2026-04-07 | Hardware: Mac Mini M4 16GB | Vision: Falcon Perception v2
Test Results
ββββββ¬ββββββββββββββββββββ¬βββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ
β β β β Qwen3.5-9B β Gemma4 E4B β Bonsai-8B β
β # β Task β Diff. β llama.cpp β mlx_vlm+proxy β prism llama.cpp β
ββββββΌββββββββββββββββββββΌβββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β 1 β Wikipedia extract β Easy β 51s 1T β οΈ β 30s 4T β
β 33s 1T β οΈ β
β 2 β DDG search β Medium β 180s 6T β οΈ β 60s 6T β
β 53s 16T β β
β 3 β HN top story β Easy β 32s 2T β
β 30s 2T β
β 4s 1T β οΈ β
β 4 β Cat vision (FP) β Medium β 34s 2T β
β 40s 3T β
β 5s 1T β β
β 5 β Form filling β Medium β 237s 6T β
β 60s 1T β β 4s 1T β β
β 6 β reCAPTCHA β Hard β 300s 6T β οΈ β 120s 13T β οΈ β 4s 1T β β
ββββββΌββββββββββββββββββββΌβββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β β PASS / PARTIAL β β 3β
2β οΈ 1β β 4β
1β οΈ 1β β 0β
2β οΈ 4β β
β β Total time β β 834s β 340s β 103s β
β β Total tool calls β β 23 β 29 β 21 β
ββββββ΄ββββββββββββββββββββ΄βββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ
T = tool calls | β
= pass | β οΈ = partial | β = fail
Speed Comparison
Generation speed (tok/s):
Bonsai-8B ββββββββββββββββββββββββββββββββββββββββββββββββ 48.8
Gemma4 E4B ββββββββββββββββββββββββββββββββββ 35.0
Qwen3.5-9B ββββββββββ 10.0
Total suite time:
Bonsai-8B ββββββ 103s (1.7 min)
Gemma4 E4B βββββββββββββββββββββ 340s (5.7 min)
Qwen3.5-9B βββββββββββββββββββββββββββββββββββββββββββββββββββββ 834s (13.9 min)
Model Specs
ββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β β Qwen3.5-9B β Gemma4 E4B β Bonsai-8B β
ββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Architecture β Dense β MoE (4B act) β Dense 1-bit β
β Total params β 9B β 9B β 8B β
β Active params β 9B β 4B β 8B β
β Quantization β Q4_K_XL β 4-bit MLX β Q1_0_g128 β
β Disk size β 5.6 GB β ~5 GB β 1.15 GB β
β Memory (peak) β ~6.5 GB β ~5.4 GB β ~1.5 GB β
β Generation tok/s β ~10 β ~35 β ~49 β
β Base model β Qwen3.5 β Gemma4 β Qwen3 β
β Vision β β
mmproj β β
native β β text only β
β Tool calling β β
native β β
patched β β
native β
β Proxy needed β β β β
β β β
β Server β llama.cpp β mlx_vlm β PrismML fork β
ββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
Task-by-Task Analysis
T1: Wikipedia Extract (Easy)
Qwen: 1 tool (scrape_url), got some info β οΈ Partial
Gemma: 4 tools (navigate+extract), all 3 answers β
Pass
Bonsai: 1 tool (navigate), stopped too early β οΈ Partial
T2: DuckDuckGo Search (Medium)
Qwen: 6 tools, hit 180s timeout β οΈ Partial (slow)
Gemma: 6 tools (go+type+click+scroll+extract) β
Pass
Bonsai: 16 tools but 21 idle turns, confused β Fail
T3: HN Top Story (Easy)
Qwen: 2 tools, clean stop_loop β
Pass
Gemma: 2 tools, found title β
Pass
Bonsai: 1 tool, 4s but didn't extract properly β οΈ Partial
T4: Cat Vision with Falcon (Medium)
Qwen: 2 tools incl vision_detect β
Pass
Gemma: 3 tools, 2x vision_detect, found 6 cats β
Pass
Bonsai: 1 tool, never called vision_detect β Fail
T5: Form Filling (Medium)
Qwen: 6 tools (go+input+click x3), stop_loop β
Pass β ONLY WINNER
Gemma: 1 tool, stuck thinking about form fields β Fail
Bonsai: 1 tool, only navigated β Fail
T6: reCAPTCHA (Hard)
Qwen: 6 tools, 3 vision, 1 batch click, timeout β οΈ Partial
Gemma: 13 tools, 9 vision, 5 batch clicks, timeout β οΈ Partial (most active)
Bonsai: 1 tool, only navigated β Fail
Scoring Matrix
βββββββββββββββββββββ¬ββββββββ¬ββββββββ¬ββββββββ
β Task β Qwen β Gemma βBonsai β
βββββββββββββββββββββΌββββββββΌββββββββΌββββββββ€
β T1 Wikipedia β 0.5 β 1.0 β 0.5 β
β T2 DDG Search β 0.5 β 1.0 β 0.0 β
β T3 HN Story β 1.0 β 1.0 β 0.5 β
β T4 Cat Vision β 1.0 β 1.0 β 0.0 β
β T5 Form Fill β 1.0 β 0.0 β 0.0 β
β T6 reCAPTCHA β 0.5 β 0.5 β 0.0 β
βββββββββββββββββββββΌββββββββΌββββββββΌββββββββ€
β TOTAL β 4.5 β 4.5 β 1.0 β
β Speed factor β 1x β 3.5x β 4.9x β
β Score/minute β 0.32 β 0.79 β 0.58 β
βββββββββββββββββββββ΄ββββββββ΄ββββββββ΄ββββββββ
Score/minute = total_score / suite_time_minutes
Memory Budget on 16GB
ββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β Component β +Qwen β +Gemma4 β +Bonsai β
ββββββββββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β LLM model β 6.5 GB β 5.4 GB β 1.5 GB β
β Falcon Perception β 1.5 GB β 1.5 GB β 1.5 GB β
β GUA_Blazor + Browser β 0.8 GB β 0.8 GB β 0.8 GB β
β Proxy β β β 0.1 GB β β β
β OS + system β 3.0 GB β 3.0 GB β 3.0 GB β
ββββββββββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β TOTAL β 11.8 GB β 10.8 GB β 6.8 GB β
β HEADROOM β 4.2 GB β 5.2 GB β 9.2 GB β
β
ββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
Bonsai leaves 9.2 GB free β enough to run ANOTHER model simultaneously!
Verdict
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π BEST OVERALL: Gemma4 E4B β
β 4.5 score, 3.5x speed, best score/minute (0.79) β
β Wins: Wikipedia, DDG search, HN, Cat vision β
β β
β π₯ MOST RELIABLE: Qwen3.5-9B β
β 4.5 score, slowest but ONLY model that fills forms β
β Wins: Form filling (exclusive), HN, Cat vision β
β β
β π₯ FASTEST BUT WEAKEST: Bonsai-8B β
β 1.0 score, 4.9x speed, but can't do multi-step tasks β
β Only 1.15 GB β could pair with another model β
β β
β IDEAL SETUP: β
β Bonsai-8B (1.5 GB) + Qwen3.5-9B (6.5 GB) = 8 GB β
β Route simpleβBonsai, complexβQwen β
β Both fit on 16GB simultaneously! β
β β
β OR: Gemma4 E4B alone (best score/minute, most versatile) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
All tests run on Mac Mini M4 16GB with identical GUA_Blazor agent loop, Falcon Perception v2 vision backend, and same prompts.