File size: 7,828 Bytes
221ca5c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | # Small LLM Agent Benchmark
**Real-world browser agent benchmark for small language models on Apple Silicon (16GB)**
> BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.
## TL;DR Results
| Rank | Model | Score | Speed | Memory | Notes |
|------|-------|-------|-------|--------|-------|
| π 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best |
| π 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable |
| β¨ 3 | **LFM2-1.2B-Tool Q8_0 (slim)** | **4.5/6** | **76 tok/s** | **2.75 GB** | Efficiency king |
| 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | |
| 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | |
| 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! |
| 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | |
| 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | |
| 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here |
| 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training |
| 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small |
| 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop |
| 13 | Qwopus-27B Q3_K_S | OOM | β | 14+ GB | Doesn't fit 16GB |
## What This Benchmarks
6 real-world browser agent tasks, not synthetic function-call formatting tests:
| # | Task | Difficulty | What it tests |
|---|------|-----------|---------------|
| T1 | Wikipedia info extraction | Easy | Navigate β extract β report |
| T2 | DuckDuckGo search | Medium | Navigate β type β click β read |
| T3 | Hacker News top story | Easy | Navigate β read β stop |
| T4 | Cat image detection (Falcon Perception) | Medium | Navigate β vision_detect β report |
| T5 | Form filling (httpbin POST) | Medium | Navigate β input Γ 3 β click submit |
| T6 | reCAPTCHA challenge | Hard | Navigate β click β vision β batch click |
Each test requires **multi-step tool chaining** β not single-turn function call formatting.
## 10 Counter-Intuitive Findings
1. **BFCL β Agent Capability** β Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
2. **Higher Quant β Better for MoE** β Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
3. **Higher Quant = Better for Dense** β Qwen: Q4 (3.5) < Q6 (5.0)
4. **Uncensored β Better Agent** β Quality gains come from quantization, not censoring
5. **Faster Backend β Better Results** β GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
6. **197 tok/s Model Scores 0/6** β FunctionGemma is useless despite being fastest
7. **4B MoE = 9B Dense** β Gemma4 E4B matches Qwen3.5-9B on agent tasks
8. **1.2B Specialized > 8B Base** β LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
9. **The "Capability Cliff" Has Exceptions** β LFM2-1.2B-Tool breaks the 4B param rule
10. **Small Models Are Context-Starved** β Reducing tools 26β8 pushed LFM2 from 4.0β4.5
## 5-Axis Analysis
### Axis 1: Model Family
- **Minimum ~4B active params** for multi-step agent tasks (with one exception)
- MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
- Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training
### Axis 2: Censoring
- Uncensored models show **no advantage** for tool-calling agent tasks
- Quality improvements are entirely from quantization level, not censoring
### Axis 3: Quantization
- **MoE models**: Q5 is the sweet spot (speed > precision)
- **Dense models**: Q6 is the sweet spot (precision > speed)
- Never go below Q4 or above Q8 for agent tasks
### Axis 4: Backend
- **llama.cpp GGUF** is the universal winner β native tool calling, no proxy
- MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
- Ollama has API format issues with Gemma4
### Axis 5: Vision
- **mmproj + Falcon Perception** together score 5.0/6 (best)
- Either alone scores 4.5/6
- Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates
## Hardware
- Mac Mini M4 16GB (Apple Silicon)
- macOS Darwin 24.3.0
- llama.cpp b8640 (homebrew)
- Falcon Perception v2 (MLX backend)
- GUA_Blazor .NET 10 agent framework
## Architecture
```
User Task β GUA_Blazor (agent loop, 25 turns)
β LLM (llama.cpp, port 8081) β reasoning + tool calling
β Falcon Perception (MLX, port 8090) β vision detection
β Playwright Chromium β browser automation
```
## Run It Yourself
### Prerequisites
```bash
# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception # or clone github.com/tiiuae/falcon-perception
```
### Quick Speed Test
```bash
# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models
# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
--port 8081 -ngl 99 -c 16384
# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "test",
"stream": false,
"messages": [{"role": "user", "content": "Navigate to google.com"}],
"tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'
```
### Run Benchmark
```bash
python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf
```
## Models Tested
| Model | HuggingFace | Backend | mmproj? |
|-------|-------------|---------|---------|
| Gemma4 E4B Uncensored | [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Uncensored | [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Base | [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) | llama.cpp | Yes |
| LFM2-1.2B-Tool | [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) | llama.cpp | No (text only) |
| Bonsai-8B | [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | PrismML fork | No |
| Gemma4 E4B Base (MLX) | [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) | mlx_vlm | Native |
| LFM2-8B-A1B | [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) | llama.cpp | No |
| FunctionGemma 270M | [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) | llama.cpp | No |
## Key Files
```
bench/
run_benchmark.py β Main benchmark runner
tasks.json β 6 test task definitions
results/ β Raw results from all runs
reports/
FINAL_Report.md β Complete 5-axis analysis
Multi_Axis_Analysis.md β Detailed breakdown per axis
Model_Comparison.md β Side-by-side tables
proxies/
gemma4_proxy.py β Gemma4 MLX β LlmTornado proxy (7 fixes)
lfm2_proxy.py β LFM2 pythonic tool-call proxy
vision/
falcon_vision_server.py β Falcon Perception 3-layer adaptive pipeline
```
## Citation
If you use this benchmark, please cite:
```
@misc{small-llm-agent-bench-2026,
title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
author={Xavier},
year={2026},
url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}
```
## License
MIT
|