| # Small LLM Agent Benchmark |
|
|
| **Real-world browser agent benchmark for small language models on Apple Silicon (16GB)** |
|
|
| > BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes. |
|
|
| ## TL;DR Results |
|
|
| | Rank | Model | Score | Speed | Memory | Notes | |
| |------|-------|-------|-------|--------|-------| |
| | π 1 | Gemma4 E4B Uncensored Q5_K_P | 5.0/6 | 24.5 tok/s | 6.3 GB | Overall best | |
| | π 2 | Qwen3.5-9B Uncensored Q6_K | 5.0/6 | 13.5 tok/s | 7.8 GB | Most reliable | |
| | β¨ 3 | **LFM2-1.2B-Tool Q8_0 (slim)** | **4.5/6** | **76 tok/s** | **2.75 GB** | Efficiency king | |
| | 4 | Gemma4 E4B Uncensored Q6_K_P | 4.5/6 | 23.1 tok/s | 6.7 GB | | |
| | 5 | Qwen3.5-9B Base Q4_K_XL | 4.5/6 | 10.0 tok/s | 6.5 GB | | |
| | 6 | Gemma4 E4B Uncensored Q8_K_P | 4.0/6 | 19.0 tok/s | 8.5 GB | Higher quant = worse! | |
| | 7 | Qwen3.5-9B Uncensored Q4_K_M | 3.5/6 | 16.7 tok/s | 6.1 GB | | |
| | 8 | Qwen3VL-8B Balanced Q6_K | 3.0/6 | 16.2 tok/s | 7.4 GB | | |
| | 9 | Bonsai-8B 1-bit | 1.0/6 | 48.8 tok/s | 1.5 GB | 73% BFCL but 1/6 here | |
| | 10 | LFM2-8B-A1B Q6_K (1.5B active) | 1.0/6 | 69.4 tok/s | 6.4 GB | Base model, no tool training | |
| | 11 | LFM2.5-Nova 1.2B Q4 | 0.0/6 | 118 tok/s | 0.8 GB | 4K context too small | |
| | 12 | FunctionGemma 270M Q8 | 0.0/6 | 197 tok/s | 0.3 GB | Infinite loop | |
| | 13 | Qwopus-27B Q3_K_S | OOM | β | 14+ GB | Doesn't fit 16GB | |
| |
| ## What This Benchmarks |
| |
| 6 real-world browser agent tasks, not synthetic function-call formatting tests: |
| |
| | # | Task | Difficulty | What it tests | |
| |---|------|-----------|---------------| |
| | T1 | Wikipedia info extraction | Easy | Navigate β extract β report | |
| | T2 | DuckDuckGo search | Medium | Navigate β type β click β read | |
| | T3 | Hacker News top story | Easy | Navigate β read β stop | |
| | T4 | Cat image detection (Falcon Perception) | Medium | Navigate β vision_detect β report | |
| | T5 | Form filling (httpbin POST) | Medium | Navigate β input Γ 3 β click submit | |
| | T6 | reCAPTCHA challenge | Hard | Navigate β click β vision β batch click | |
|
|
| Each test requires **multi-step tool chaining** β not single-turn function call formatting. |
|
|
| ## 10 Counter-Intuitive Findings |
|
|
| 1. **BFCL β Agent Capability** β Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks |
| 2. **Higher Quant β Better for MoE** β Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0) |
| 3. **Higher Quant = Better for Dense** β Qwen: Q4 (3.5) < Q6 (5.0) |
| 4. **Uncensored β Better Agent** β Quality gains come from quantization, not censoring |
| 5. **Faster Backend β Better Results** β GGUF 24 tok/s beats MLX 35 tok/s (proxy issues) |
| 6. **197 tok/s Model Scores 0/6** β FunctionGemma is useless despite being fastest |
| 7. **4B MoE = 9B Dense** β Gemma4 E4B matches Qwen3.5-9B on agent tasks |
| 8. **1.2B Specialized > 8B Base** β LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6) |
| 9. **The "Capability Cliff" Has Exceptions** β LFM2-1.2B-Tool breaks the 4B param rule |
| 10. **Small Models Are Context-Starved** β Reducing tools 26β8 pushed LFM2 from 4.0β4.5 |
|
|
| ## 5-Axis Analysis |
|
|
| ### Axis 1: Model Family |
| - **Minimum ~4B active params** for multi-step agent tasks (with one exception) |
| - MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost |
| - Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training |
|
|
| ### Axis 2: Censoring |
| - Uncensored models show **no advantage** for tool-calling agent tasks |
| - Quality improvements are entirely from quantization level, not censoring |
|
|
| ### Axis 3: Quantization |
| - **MoE models**: Q5 is the sweet spot (speed > precision) |
| - **Dense models**: Q6 is the sweet spot (precision > speed) |
| - Never go below Q4 or above Q8 for agent tasks |
|
|
| ### Axis 4: Backend |
| - **llama.cpp GGUF** is the universal winner β native tool calling, no proxy |
| - MLX is faster but needs a 7-fix proxy for LlmTornado compatibility |
| - Ollama has API format issues with Gemma4 |
|
|
| ### Axis 5: Vision |
| - **mmproj + Falcon Perception** together score 5.0/6 (best) |
| - Either alone scores 4.5/6 |
| - Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates |
|
|
| ## Hardware |
|
|
| - Mac Mini M4 16GB (Apple Silicon) |
| - macOS Darwin 24.3.0 |
| - llama.cpp b8640 (homebrew) |
| - Falcon Perception v2 (MLX backend) |
| - GUA_Blazor .NET 10 agent framework |
| |
| ## Architecture |
| |
| ``` |
| User Task β GUA_Blazor (agent loop, 25 turns) |
| β LLM (llama.cpp, port 8081) β reasoning + tool calling |
| β Falcon Perception (MLX, port 8090) β vision detection |
| β Playwright Chromium β browser automation |
| ``` |
| |
| ## Run It Yourself |
| |
| ### Prerequisites |
| ```bash |
| # macOS with Apple Silicon |
| brew install llama.cpp |
| pip install falcon-perception # or clone github.com/tiiuae/falcon-perception |
| ``` |
| |
| ### Quick Speed Test |
| ```bash |
| # Download a model |
| huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \ |
| Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models |
|
|
| # Start server |
| llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \ |
| --port 8081 -ngl 99 -c 16384 |
|
|
| # Test tool calling |
| curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{ |
| "model": "test", |
| "stream": false, |
| "messages": [{"role": "user", "content": "Navigate to google.com"}], |
| "tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}] |
| }' |
| ``` |
| |
| ### Run Benchmark |
| ```bash |
| python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf |
| ``` |
| |
| ## Models Tested |
| |
| | Model | HuggingFace | Backend | mmproj? | |
| |-------|-------------|---------|---------| |
| | Gemma4 E4B Uncensored | [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) | |
| | Qwen3.5-9B Uncensored | [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) | llama.cpp | Yes (in repo) | |
| | Qwen3.5-9B Base | [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) | llama.cpp | Yes | |
| | LFM2-1.2B-Tool | [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) | llama.cpp | No (text only) | |
| | Bonsai-8B | [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | PrismML fork | No | |
| | Gemma4 E4B Base (MLX) | [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) | mlx_vlm | Native | |
| | LFM2-8B-A1B | [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) | llama.cpp | No | |
| | FunctionGemma 270M | [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) | llama.cpp | No | |
| |
| ## Key Files |
| |
| ``` |
| bench/ |
| run_benchmark.py β Main benchmark runner |
| tasks.json β 6 test task definitions |
| results/ β Raw results from all runs |
| reports/ |
| FINAL_Report.md β Complete 5-axis analysis |
| Multi_Axis_Analysis.md β Detailed breakdown per axis |
| Model_Comparison.md β Side-by-side tables |
| proxies/ |
| gemma4_proxy.py β Gemma4 MLX β LlmTornado proxy (7 fixes) |
| lfm2_proxy.py β LFM2 pythonic tool-call proxy |
| vision/ |
| falcon_vision_server.py β Falcon Perception 3-layer adaptive pipeline |
| ``` |
| |
| ## Citation |
| |
| If you use this benchmark, please cite: |
| ``` |
| @misc{small-llm-agent-bench-2026, |
| title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon}, |
| author={Xavier}, |
| year={2026}, |
| url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models} |
| } |
| ``` |
| |
| ## License |
| |
| MIT |
| |