Manojb's picture
Upload folder using huggingface_hub
221ca5c verified
# Small LLM Agent Benchmark
**Real-world browser agent benchmark for small language models on Apple Silicon (16GB)**
> BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.
## TL;DR Results
| Rank | Model | Score | Speed | Memory | Notes |
|------|-------|-------|-------|--------|-------|
| πŸ† 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best |
| πŸ† 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable |
| ✨ 3 | **LFM2-1.2B-Tool Q8_0 (slim)** | **4.5/6** | **76 tok/s** | **2.75 GB** | Efficiency king |
| 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | |
| 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | |
| 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! |
| 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | |
| 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | |
| 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here |
| 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training |
| 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small |
| 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop |
| 13 | Qwopus-27B Q3_K_S | OOM | β€” | 14+ GB | Doesn't fit 16GB |
## What This Benchmarks
6 real-world browser agent tasks, not synthetic function-call formatting tests:
| # | Task | Difficulty | What it tests |
|---|------|-----------|---------------|
| T1 | Wikipedia info extraction | Easy | Navigate β†’ extract β†’ report |
| T2 | DuckDuckGo search | Medium | Navigate β†’ type β†’ click β†’ read |
| T3 | Hacker News top story | Easy | Navigate β†’ read β†’ stop |
| T4 | Cat image detection (Falcon Perception) | Medium | Navigate β†’ vision_detect β†’ report |
| T5 | Form filling (httpbin POST) | Medium | Navigate β†’ input Γ— 3 β†’ click submit |
| T6 | reCAPTCHA challenge | Hard | Navigate β†’ click β†’ vision β†’ batch click |
Each test requires **multi-step tool chaining** β€” not single-turn function call formatting.
## 10 Counter-Intuitive Findings
1. **BFCL β‰  Agent Capability** β€” Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
2. **Higher Quant β‰  Better for MoE** β€” Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
3. **Higher Quant = Better for Dense** β€” Qwen: Q4 (3.5) < Q6 (5.0)
4. **Uncensored β‰  Better Agent** β€” Quality gains come from quantization, not censoring
5. **Faster Backend β‰  Better Results** β€” GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
6. **197 tok/s Model Scores 0/6** β€” FunctionGemma is useless despite being fastest
7. **4B MoE = 9B Dense** β€” Gemma4 E4B matches Qwen3.5-9B on agent tasks
8. **1.2B Specialized > 8B Base** β€” LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
9. **The "Capability Cliff" Has Exceptions** β€” LFM2-1.2B-Tool breaks the 4B param rule
10. **Small Models Are Context-Starved** β€” Reducing tools 26β†’8 pushed LFM2 from 4.0β†’4.5
## 5-Axis Analysis
### Axis 1: Model Family
- **Minimum ~4B active params** for multi-step agent tasks (with one exception)
- MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
- Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training
### Axis 2: Censoring
- Uncensored models show **no advantage** for tool-calling agent tasks
- Quality improvements are entirely from quantization level, not censoring
### Axis 3: Quantization
- **MoE models**: Q5 is the sweet spot (speed > precision)
- **Dense models**: Q6 is the sweet spot (precision > speed)
- Never go below Q4 or above Q8 for agent tasks
### Axis 4: Backend
- **llama.cpp GGUF** is the universal winner β€” native tool calling, no proxy
- MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
- Ollama has API format issues with Gemma4
### Axis 5: Vision
- **mmproj + Falcon Perception** together score 5.0/6 (best)
- Either alone scores 4.5/6
- Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates
## Hardware
- Mac Mini M4 16GB (Apple Silicon)
- macOS Darwin 24.3.0
- llama.cpp b8640 (homebrew)
- Falcon Perception v2 (MLX backend)
- GUA_Blazor .NET 10 agent framework
## Architecture
```
User Task β†’ GUA_Blazor (agent loop, 25 turns)
β†’ LLM (llama.cpp, port 8081) β€” reasoning + tool calling
β†’ Falcon Perception (MLX, port 8090) β€” vision detection
β†’ Playwright Chromium β€” browser automation
```
## Run It Yourself
### Prerequisites
```bash
# macOS with Apple Silicon
brew install llama.cpp
pip install falcon-perception # or clone github.com/tiiuae/falcon-perception
```
### Quick Speed Test
```bash
# Download a model
huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models
# Start server
llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
--port 8081 -ngl 99 -c 16384
# Test tool calling
curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "test",
"stream": false,
"messages": [{"role": "user", "content": "Navigate to google.com"}],
"tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
}'
```
### Run Benchmark
```bash
python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf
```
## Models Tested
| Model | HuggingFace | Backend | mmproj? |
|-------|-------------|---------|---------|
| Gemma4 E4B Uncensored | [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Uncensored | [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) |
| Qwen3.5-9B Base | [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) | llama.cpp | Yes |
| LFM2-1.2B-Tool | [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) | llama.cpp | No (text only) |
| Bonsai-8B | [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | PrismML fork | No |
| Gemma4 E4B Base (MLX) | [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) | mlx_vlm | Native |
| LFM2-8B-A1B | [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) | llama.cpp | No |
| FunctionGemma 270M | [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) | llama.cpp | No |
## Key Files
```
bench/
run_benchmark.py β€” Main benchmark runner
tasks.json β€” 6 test task definitions
results/ β€” Raw results from all runs
reports/
FINAL_Report.md β€” Complete 5-axis analysis
Multi_Axis_Analysis.md β€” Detailed breakdown per axis
Model_Comparison.md β€” Side-by-side tables
proxies/
gemma4_proxy.py β€” Gemma4 MLX β†’ LlmTornado proxy (7 fixes)
lfm2_proxy.py β€” LFM2 pythonic tool-call proxy
vision/
falcon_vision_server.py β€” Falcon Perception 3-layer adaptive pipeline
```
## Citation
If you use this benchmark, please cite:
```
@misc{small-llm-agent-bench-2026,
title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
author={Xavier},
year={2026},
url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
}
```
## License
MIT