File size: 7,828 Bytes

221ca5c

# Small LLM Agent Benchmark

**Real-world browser agent benchmark for small language models on Apple Silicon (16GB)**

> BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.

## TL;DR Results

| Rank | Model | Score | Speed | Memory | Notes |
|------|-------|-------|-------|--------|-------|
| 🏆 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best |
| 🏆 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable |
| ✨ 3 | **LFM2-1.2B-Tool Q8_0 (slim)** | **4.5/6** | **76 tok/s** | **2.75 GB** | Efficiency king |
| 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | |
| 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | |
| 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! |
| 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | |
| 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | |
| 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here |
| 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training |
| 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small |
| 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop |
| 13 | Qwopus-27B Q3_K_S | OOM | — | 14+ GB | Doesn't fit 16GB |

## What This Benchmarks

6 real-world browser agent tasks, not synthetic function-call formatting tests:

| # | Task | Difficulty | What it tests |
|---|------|-----------|---------------|
| T1 | Wikipedia info extraction | Easy | Navigate → extract → report |
| T2 | DuckDuckGo search | Medium | Navigate → type → click → read |
| T3 | Hacker News top story | Easy | Navigate → read → stop |
| T4 | Cat image detection (Falcon Perception) | Medium | Navigate → vision_detect → report |
| T5 | Form filling (httpbin POST) | Medium | Navigate → input × 3 → click submit |
| T6 | reCAPTCHA challenge | Hard | Navigate → click → vision → batch click |

Each test requires **multi-step tool chaining** — not single-turn function call formatting.

## 10 Counter-Intuitive Findings

1. **BFCL ≠ Agent Capability** — Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
2. **Higher Quant ≠ Better for MoE** — Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
3. **Higher Quant = Better for Dense** — Qwen: Q4 (3.5) < Q6 (5.0)
4. **Uncensored ≠ Better Agent** — Quality gains come from quantization, not censoring
5. **Faster Backend ≠ Better Results** — GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
6. **197 tok/s Model Scores 0/6** — FunctionGemma is useless despite being fastest
7. **4B MoE = 9B Dense** — Gemma4 E4B matches Qwen3.5-9B on agent tasks
8. **1.2B Specialized > 8B Base** — LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
9. **The "Capability Cliff" Has Exceptions** — LFM2-1.2B-Tool breaks the 4B param rule
10. **Small Models Are Context-Starved** — Reducing tools 26→8 pushed LFM2 from 4.0→4.5

## 5-Axis Analysis

### Axis 1: Model Family
- **Minimum ~4B active params** for multi-step agent tasks (with one exception)
- MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
- Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training

### Axis 2: Censoring
- Uncensored models show **no advantage** for tool-calling agent tasks
- Quality improvements are entirely from quantization level, not censoring

### Axis 3: Quantization
- **MoE models**: Q5 is the sweet spot (speed > precision)
- **Dense models**: Q6 is the sweet spot (precision > speed)
- Never go below Q4 or above Q8 for agent tasks

### Axis 4: Backend
- **llama.cpp GGUF** is the universal winner — native tool calling, no proxy
- MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
- Ollama has API format issues with Gemma4

### Axis 5: Vision
- **mmproj + Falcon Perception** together score 5.0/6 (best)
- Either alone scores 4.5/6
- Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates

## Hardware

- Mac Mini M4 16GB (Apple Silicon)
- macOS Darwin 24.3.0
- llama.cpp b8640 (homebrew)
- Falcon Perception v2 (MLX backend)
- GUA_Blazor .NET 10 agent framework

## Architecture

```
User Task → GUA_Blazor (agent loop, 25 turns)
  → LLM (llama.cpp, port 8081) — reasoning + tool calling
  → Falcon Perception (MLX, port 8090) — vision detection
  → Playwright Chromium — browser automation
```

## Run It Yourself

### Prerequisites
```bash
# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception  # or clone github.com/tiiuae/falcon-perception
```

### Quick Speed Test
```bash
# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
  Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models

# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
  --port 8081 -ngl 99 -c 16384

# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "test",
  "stream": false,
  "messages": [{"role": "user", "content": "Navigate to google.com"}],
  "tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'
```

### Run Benchmark
```bash
python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf
```

## Models Tested

| Model | HuggingFace | Backend | mmproj? |
|-------|-------------|---------|---------|
| Gemma4 E4B Uncensored | [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Uncensored | [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Base | [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) | llama.cpp | Yes |
| LFM2-1.2B-Tool | [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) | llama.cpp | No (text only) |
| Bonsai-8B | [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | PrismML fork | No |
| Gemma4 E4B Base (MLX) | [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) | mlx_vlm | Native |
| LFM2-8B-A1B | [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) | llama.cpp | No |
| FunctionGemma 270M | [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) | llama.cpp | No |

## Key Files

```
bench/
  run_benchmark.py       — Main benchmark runner
  tasks.json             — 6 test task definitions
  results/               — Raw results from all runs
reports/
  FINAL_Report.md        — Complete 5-axis analysis
  Multi_Axis_Analysis.md — Detailed breakdown per axis
  Model_Comparison.md    — Side-by-side tables
proxies/
  gemma4_proxy.py        — Gemma4 MLX → LlmTornado proxy (7 fixes)
  lfm2_proxy.py          — LFM2 pythonic tool-call proxy
vision/
  falcon_vision_server.py — Falcon Perception 3-layer adaptive pipeline
```

## Citation

If you use this benchmark, please cite:
```
@misc{small-llm-agent-bench-2026,
  title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
  author={Xavier},
  year={2026},
  url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}
```

## License

MIT