File size: 7,828 Bytes
221ca5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# Small LLM Agent Benchmark

**Real-world browser agent benchmark for small language models on Apple Silicon (16GB)**

> BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.

## TL;DR Results

| Rank | Model | Score | Speed | Memory | Notes |
|------|-------|-------|-------|--------|-------|
| πŸ† 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best |
| πŸ† 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable |
| ✨ 3 | **LFM2-1.2B-Tool Q8_0 (slim)** | **4.5/6** | **76 tok/s** | **2.75 GB** | Efficiency king |
| 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | |
| 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | |
| 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! |
| 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | |
| 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | |
| 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here |
| 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training |
| 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small |
| 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop |
| 13 | Qwopus-27B Q3_K_S | OOM | β€” | 14+ GB | Doesn't fit 16GB |

## What This Benchmarks

6 real-world browser agent tasks, not synthetic function-call formatting tests:

| # | Task | Difficulty | What it tests |
|---|------|-----------|---------------|
| T1 | Wikipedia info extraction | Easy | Navigate β†’ extract β†’ report |
| T2 | DuckDuckGo search | Medium | Navigate β†’ type β†’ click β†’ read |
| T3 | Hacker News top story | Easy | Navigate β†’ read β†’ stop |
| T4 | Cat image detection (Falcon Perception) | Medium | Navigate β†’ vision_detect β†’ report |
| T5 | Form filling (httpbin POST) | Medium | Navigate β†’ input Γ— 3 β†’ click submit |
| T6 | reCAPTCHA challenge | Hard | Navigate β†’ click β†’ vision β†’ batch click |

Each test requires **multi-step tool chaining** β€” not single-turn function call formatting.

## 10 Counter-Intuitive Findings

1. **BFCL β‰  Agent Capability** β€” Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
2. **Higher Quant β‰  Better for MoE** β€” Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
3. **Higher Quant = Better for Dense** β€” Qwen: Q4 (3.5) < Q6 (5.0)
4. **Uncensored β‰  Better Agent** β€” Quality gains come from quantization, not censoring
5. **Faster Backend β‰  Better Results** β€” GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
6. **197 tok/s Model Scores 0/6** β€” FunctionGemma is useless despite being fastest
7. **4B MoE = 9B Dense** β€” Gemma4 E4B matches Qwen3.5-9B on agent tasks
8. **1.2B Specialized > 8B Base** β€” LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
9. **The "Capability Cliff" Has Exceptions** β€” LFM2-1.2B-Tool breaks the 4B param rule
10. **Small Models Are Context-Starved** β€” Reducing tools 26β†’8 pushed LFM2 from 4.0β†’4.5

## 5-Axis Analysis

### Axis 1: Model Family
- **Minimum ~4B active params** for multi-step agent tasks (with one exception)
- MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
- Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training

### Axis 2: Censoring
- Uncensored models show **no advantage** for tool-calling agent tasks
- Quality improvements are entirely from quantization level, not censoring

### Axis 3: Quantization
- **MoE models**: Q5 is the sweet spot (speed > precision)
- **Dense models**: Q6 is the sweet spot (precision > speed)
- Never go below Q4 or above Q8 for agent tasks

### Axis 4: Backend
- **llama.cpp GGUF** is the universal winner β€” native tool calling, no proxy
- MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
- Ollama has API format issues with Gemma4

### Axis 5: Vision
- **mmproj + Falcon Perception** together score 5.0/6 (best)
- Either alone scores 4.5/6
- Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates

## Hardware

- Mac Mini M4 16GB (Apple Silicon)
- macOS Darwin 24.3.0
- llama.cpp b8640 (homebrew)
- Falcon Perception v2 (MLX backend)
- GUA_Blazor .NET 10 agent framework

## Architecture

```
User Task β†’ GUA_Blazor (agent loop, 25 turns)
  β†’ LLM (llama.cpp, port 8081) β€” reasoning + tool calling
  β†’ Falcon Perception (MLX, port 8090) β€” vision detection
  β†’ Playwright Chromium β€” browser automation
```

## Run It Yourself

### Prerequisites
```bash
# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception  # or clone github.com/tiiuae/falcon-perception
```

### Quick Speed Test
```bash
# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
  Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models

# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
  --port 8081 -ngl 99 -c 16384

# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "test",
  "stream": false,
  "messages": [{"role": "user", "content": "Navigate to google.com"}],
  "tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'
```

### Run Benchmark
```bash
python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf
```

## Models Tested

| Model | HuggingFace | Backend | mmproj? |
|-------|-------------|---------|---------|
| Gemma4 E4B Uncensored | [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Uncensored | [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Base | [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) | llama.cpp | Yes |
| LFM2-1.2B-Tool | [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) | llama.cpp | No (text only) |
| Bonsai-8B | [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | PrismML fork | No |
| Gemma4 E4B Base (MLX) | [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) | mlx_vlm | Native |
| LFM2-8B-A1B | [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) | llama.cpp | No |
| FunctionGemma 270M | [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) | llama.cpp | No |

## Key Files

```
bench/
  run_benchmark.py       β€” Main benchmark runner
  tasks.json             β€” 6 test task definitions
  results/               β€” Raw results from all runs
reports/
  FINAL_Report.md        β€” Complete 5-axis analysis
  Multi_Axis_Analysis.md β€” Detailed breakdown per axis
  Model_Comparison.md    β€” Side-by-side tables
proxies/
  gemma4_proxy.py        β€” Gemma4 MLX β†’ LlmTornado proxy (7 fixes)
  lfm2_proxy.py          β€” LFM2 pythonic tool-call proxy
vision/
  falcon_vision_server.py β€” Falcon Perception 3-layer adaptive pipeline
```

## Citation

If you use this benchmark, please cite:
```
@misc{small-llm-agent-bench-2026,
  title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
  author={Xavier},
  year={2026},
  url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}
```

## License

MIT