File size: 10,911 Bytes
221ca5c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | # Three-Model Comparison: Qwen3.5-9B vs Gemma4 E4B vs Bonsai-8B
**Date:** 2026-04-07 | **Hardware:** Mac Mini M4 16GB | **Vision:** Falcon Perception v2
---
## Test Results
```
ββββββ¬ββββββββββββββββββββ¬βββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ
β β β β Qwen3.5-9B β Gemma4 E4B β Bonsai-8B β
β # β Task β Diff. β llama.cpp β mlx_vlm+proxy β prism llama.cpp β
ββββββΌββββββββββββββββββββΌβββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β 1 β Wikipedia extract β Easy β 51s 1T β οΈ β 30s 4T β
β 33s 1T β οΈ β
β 2 β DDG search β Medium β 180s 6T β οΈ β 60s 6T β
β 53s 16T β β
β 3 β HN top story β Easy β 32s 2T β
β 30s 2T β
β 4s 1T β οΈ β
β 4 β Cat vision (FP) β Medium β 34s 2T β
β 40s 3T β
β 5s 1T β β
β 5 β Form filling β Medium β 237s 6T β
β 60s 1T β β 4s 1T β β
β 6 β reCAPTCHA β Hard β 300s 6T β οΈ β 120s 13T β οΈ β 4s 1T β β
ββββββΌββββββββββββββββββββΌβββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β β PASS / PARTIAL β β 3β
2β οΈ 1β β 4β
1β οΈ 1β β 0β
2β οΈ 4β β
β β Total time β β 834s β 340s β 103s β
β β Total tool calls β β 23 β 29 β 21 β
ββββββ΄ββββββββββββββββββββ΄βββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ
T = tool calls | β
= pass | β οΈ = partial | β = fail
```
## Speed Comparison
```
Generation speed (tok/s):
Bonsai-8B ββββββββββββββββββββββββββββββββββββββββββββββββ 48.8
Gemma4 E4B ββββββββββββββββββββββββββββββββββ 35.0
Qwen3.5-9B ββββββββββ 10.0
Total suite time:
Bonsai-8B ββββββ 103s (1.7 min)
Gemma4 E4B βββββββββββββββββββββ 340s (5.7 min)
Qwen3.5-9B βββββββββββββββββββββββββββββββββββββββββββββββββββββ 834s (13.9 min)
```
## Model Specs
```
ββββββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β β Qwen3.5-9B β Gemma4 E4B β Bonsai-8B β
ββββββββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββ€
β Architecture β Dense β MoE (4B act) β Dense 1-bit β
β Total params β 9B β 9B β 8B β
β Active params β 9B β 4B β 8B β
β Quantization β Q4_K_XL β 4-bit MLX β Q1_0_g128 β
β Disk size β 5.6 GB β ~5 GB β 1.15 GB β
β Memory (peak) β ~6.5 GB β ~5.4 GB β ~1.5 GB β
β Generation tok/s β ~10 β ~35 β ~49 β
β Base model β Qwen3.5 β Gemma4 β Qwen3 β
β Vision β β
mmproj β β
native β β text only β
β Tool calling β β
native β β
patched β β
native β
β Proxy needed β β β β
β β β
β Server β llama.cpp β mlx_vlm β PrismML fork β
ββββββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
```
## Task-by-Task Analysis
### T1: Wikipedia Extract (Easy)
```
Qwen: 1 tool (scrape_url), got some info β οΈ Partial
Gemma: 4 tools (navigate+extract), all 3 answers β
Pass
Bonsai: 1 tool (navigate), stopped too early β οΈ Partial
```
### T2: DuckDuckGo Search (Medium)
```
Qwen: 6 tools, hit 180s timeout β οΈ Partial (slow)
Gemma: 6 tools (go+type+click+scroll+extract) β
Pass
Bonsai: 16 tools but 21 idle turns, confused β Fail
```
### T3: HN Top Story (Easy)
```
Qwen: 2 tools, clean stop_loop β
Pass
Gemma: 2 tools, found title β
Pass
Bonsai: 1 tool, 4s but didn't extract properly β οΈ Partial
```
### T4: Cat Vision with Falcon (Medium)
```
Qwen: 2 tools incl vision_detect β
Pass
Gemma: 3 tools, 2x vision_detect, found 6 cats β
Pass
Bonsai: 1 tool, never called vision_detect β Fail
```
### T5: Form Filling (Medium)
```
Qwen: 6 tools (go+input+click x3), stop_loop β
Pass β ONLY WINNER
Gemma: 1 tool, stuck thinking about form fields β Fail
Bonsai: 1 tool, only navigated β Fail
```
### T6: reCAPTCHA (Hard)
```
Qwen: 6 tools, 3 vision, 1 batch click, timeout β οΈ Partial
Gemma: 13 tools, 9 vision, 5 batch clicks, timeout β οΈ Partial (most active)
Bonsai: 1 tool, only navigated β Fail
```
## Scoring Matrix
```
βββββββββββββββββββββ¬ββββββββ¬ββββββββ¬ββββββββ
β Task β Qwen β Gemma βBonsai β
βββββββββββββββββββββΌββββββββΌββββββββΌββββββββ€
β T1 Wikipedia β 0.5 β 1.0 β 0.5 β
β T2 DDG Search β 0.5 β 1.0 β 0.0 β
β T3 HN Story β 1.0 β 1.0 β 0.5 β
β T4 Cat Vision β 1.0 β 1.0 β 0.0 β
β T5 Form Fill β 1.0 β 0.0 β 0.0 β
β T6 reCAPTCHA β 0.5 β 0.5 β 0.0 β
βββββββββββββββββββββΌββββββββΌββββββββΌββββββββ€
β TOTAL β 4.5 β 4.5 β 1.0 β
β Speed factor β 1x β 3.5x β 4.9x β
β Score/minute β 0.32 β 0.79 β 0.58 β
βββββββββββββββββββββ΄ββββββββ΄ββββββββ΄ββββββββ
Score/minute = total_score / suite_time_minutes
```
## Memory Budget on 16GB
```
ββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
β Component β +Qwen β +Gemma4 β +Bonsai β
ββββββββββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β LLM model β 6.5 GB β 5.4 GB β 1.5 GB β
β Falcon Perception β 1.5 GB β 1.5 GB β 1.5 GB β
β GUA_Blazor + Browser β 0.8 GB β 0.8 GB β 0.8 GB β
β Proxy β β β 0.1 GB β β β
β OS + system β 3.0 GB β 3.0 GB β 3.0 GB β
ββββββββββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
β TOTAL β 11.8 GB β 10.8 GB β 6.8 GB β
β HEADROOM β 4.2 GB β 5.2 GB β 9.2 GB β
β
ββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
Bonsai leaves 9.2 GB free β enough to run ANOTHER model simultaneously!
```
## Verdict
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π BEST OVERALL: Gemma4 E4B β
β 4.5 score, 3.5x speed, best score/minute (0.79) β
β Wins: Wikipedia, DDG search, HN, Cat vision β
β β
β π₯ MOST RELIABLE: Qwen3.5-9B β
β 4.5 score, slowest but ONLY model that fills forms β
β Wins: Form filling (exclusive), HN, Cat vision β
β β
β π₯ FASTEST BUT WEAKEST: Bonsai-8B β
β 1.0 score, 4.9x speed, but can't do multi-step tasks β
β Only 1.15 GB β could pair with another model β
β β
β IDEAL SETUP: β
β Bonsai-8B (1.5 GB) + Qwen3.5-9B (6.5 GB) = 8 GB β
β Route simpleβBonsai, complexβQwen β
β Both fit on 16GB simultaneously! β
β β
β OR: Gemma4 E4B alone (best score/minute, most versatile) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
*All tests run on Mac Mini M4 16GB with identical GUA_Blazor agent loop,*
*Falcon Perception v2 vision backend, and same prompts.*
|