# Small LLM Agent Benchmark **Real-world browser agent benchmark for small language models on Apple Silicon (16GB)** > BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes. ## TL;DR Results | Rank | Model | Score | Speed | Memory | Notes | |------|-------|-------|-------|--------|-------| | 🏆 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best | | 🏆 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable | | ✨ 3 | **LFM2-1.2B-Tool Q8_0 (slim)** | **4.5/6** | **76 tok/s** | **2.75 GB** | Efficiency king | | 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | | | 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | | | 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! | | 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | | | 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | | | 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here | | 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training | | 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small | | 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop | | 13 | Qwopus-27B Q3_K_S | OOM | — | 14+ GB | Doesn't fit 16GB | ## What This Benchmarks 6 real-world browser agent tasks, not synthetic function-call formatting tests: | # | Task | Difficulty | What it tests | |---|------|-----------|---------------| | T1 | Wikipedia info extraction | Easy | Navigate → extract → report | | T2 | DuckDuckGo search | Medium | Navigate → type → click → read | | T3 | Hacker News top story | Easy | Navigate → read → stop | | T4 | Cat image detection (Falcon Perception) | Medium | Navigate → vision_detect → report | | T5 | Form filling (httpbin POST) | Medium | Navigate → input × 3 → click submit | | T6 | reCAPTCHA challenge | Hard | Navigate → click → vision → batch click | Each test requires **multi-step tool chaining** — not single-turn function call formatting. ## 10 Counter-Intuitive Findings 1. **BFCL ≠ Agent Capability** — Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks 2. **Higher Quant ≠ Better for MoE** — Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0) 3. **Higher Quant = Better for Dense** — Qwen: Q4 (3.5) < Q6 (5.0) 4. **Uncensored ≠ Better Agent** — Quality gains come from quantization, not censoring 5. **Faster Backend ≠ Better Results** — GGUF 24 tok/s beats MLX 35 tok/s (proxy issues) 6. **197 tok/s Model Scores 0/6** — FunctionGemma is useless despite being fastest 7. **4B MoE = 9B Dense** — Gemma4 E4B matches Qwen3.5-9B on agent tasks 8. **1.2B Specialized > 8B Base** — LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6) 9. **The "Capability Cliff" Has Exceptions** — LFM2-1.2B-Tool breaks the 4B param rule 10. **Small Models Are Context-Starved** — Reducing tools 26→8 pushed LFM2 from 4.0→4.5 ## 5-Axis Analysis ### Axis 1: Model Family - **Minimum ~4B active params** for multi-step agent tasks (with one exception) - MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost - Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training ### Axis 2: Censoring - Uncensored models show **no advantage** for tool-calling agent tasks - Quality improvements are entirely from quantization level, not censoring ### Axis 3: Quantization - **MoE models**: Q5 is the sweet spot (speed > precision) - **Dense models**: Q6 is the sweet spot (precision > speed) - Never go below Q4 or above Q8 for agent tasks ### Axis 4: Backend - **llama.cpp GGUF** is the universal winner — native tool calling, no proxy - MLX is faster but needs a 7-fix proxy for LlmTornado compatibility - Ollama has API format issues with Gemma4 ### Axis 5: Vision - **mmproj + Falcon Perception** together score 5.0/6 (best) - Either alone scores 4.5/6 - Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates ## Hardware - Mac Mini M4 16GB (Apple Silicon) - macOS Darwin 24.3.0 - llama.cpp b8640 (homebrew) - Falcon Perception v2 (MLX backend) - GUA_Blazor .NET 10 agent framework ## Architecture ``` User Task → GUA_Blazor (agent loop, 25 turns) → LLM (llama.cpp, port 8081) — reasoning + tool calling → Falcon Perception (MLX, port 8090) — vision detection → Playwright Chromium — browser automation ``` ## Run It Yourself ### Prerequisites ```bash # macOS with Apple Silicon brew install llama.cpp pip install falcon-perception # or clone github.com/tiiuae/falcon-perception ``` ### Quick Speed Test ```bash # Download a model huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \ Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models # Start server llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \ --port 8081 -ngl 99 -c 16384 # Test tool calling curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model": "test", "stream": false, "messages": [{"role": "user", "content": "Navigate to google.com"}], "tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}] }' ``` ### Run Benchmark ```bash python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf ``` ## Models Tested | Model | HuggingFace | Backend | mmproj? | |-------|-------------|---------|---------| | Gemma4 E4B Uncensored | [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) | | Qwen3.5-9B Uncensored | [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) | | Qwen3.5-9B Base | [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) | llama.cpp | Yes | | LFM2-1.2B-Tool | [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) | llama.cpp | No (text only) | | Bonsai-8B | [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | PrismML fork | No | | Gemma4 E4B Base (MLX) | [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) | mlx_vlm | Native | | LFM2-8B-A1B | [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) | llama.cpp | No | | FunctionGemma 270M | [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) | llama.cpp | No | ## Key Files ``` bench/ run_benchmark.py — Main benchmark runner tasks.json — 6 test task definitions results/ — Raw results from all runs reports/ FINAL_Report.md — Complete 5-axis analysis Multi_Axis_Analysis.md — Detailed breakdown per axis Model_Comparison.md — Side-by-side tables proxies/ gemma4_proxy.py — Gemma4 MLX → LlmTornado proxy (7 fixes) lfm2_proxy.py — LFM2 pythonic tool-call proxy vision/ falcon_vision_server.py — Falcon Perception 3-layer adaptive pipeline ``` ## Citation If you use this benchmark, please cite: ``` @misc{small-llm-agent-bench-2026, title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon}, author={Xavier}, year={2026}, url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models} } ``` ## License MIT