Upload folder using huggingface_hub

221ca5c verified about 2 months ago

7.83 kB

	# Small LLM Agent Benchmark

	Real-world browser agent benchmark for small language models on Apple Silicon (16GB)

	> BFCL says Bonsai-8B is the best tool-caller at 73%. Our benchmark says it scores 1/6 on actual agent tasks. A 1.2B model scores 4.5/6. Here's what we found testing 15+ model configurations across 5 axes.

	## TL;DR Results

	\| Rank \| Model \| Score \| Speed \| Memory \| Notes \|
	\|------\|-------\|-------\|-------\|--------\|-------\|
	\| 🏆 1 \| Gemma4 E4B Uncensored Q5_K_P \| 5.0/6 \| 24.5 tok/s \| 6.3 GB \| Overall best \|
	\| 🏆 2 \| Qwen3.5-9B Uncensored Q6_K \| 5.0/6 \| 13.5 tok/s \| 7.8 GB \| Most reliable \|
	\| ✨ 3 \| LFM2-1.2B-Tool Q8_0 (slim) \| 4.5/6 \| 76 tok/s \| 2.75 GB \| Efficiency king \|
	\| 4 \| Gemma4 E4B Uncensored Q6_K_P \| 4.5/6 \| 23.1 tok/s \| 6.7 GB \| \|
	\| 5 \| Qwen3.5-9B Base Q4_K_XL \| 4.5/6 \| 10.0 tok/s \| 6.5 GB \| \|
	\| 6 \| Gemma4 E4B Uncensored Q8_K_P \| 4.0/6 \| 19.0 tok/s \| 8.5 GB \| Higher quant = worse! \|
	\| 7 \| Qwen3.5-9B Uncensored Q4_K_M \| 3.5/6 \| 16.7 tok/s \| 6.1 GB \| \|
	\| 8 \| Qwen3VL-8B Balanced Q6_K \| 3.0/6 \| 16.2 tok/s \| 7.4 GB \| \|
	\| 9 \| Bonsai-8B 1-bit \| 1.0/6 \| 48.8 tok/s \| 1.5 GB \| 73% BFCL but 1/6 here \|
	\| 10 \| LFM2-8B-A1B Q6_K (1.5B active) \| 1.0/6 \| 69.4 tok/s \| 6.4 GB \| Base model, no tool training \|
	\| 11 \| LFM2.5-Nova 1.2B Q4 \| 0.0/6 \| 118 tok/s \| 0.8 GB \| 4K context too small \|
	\| 12 \| FunctionGemma 270M Q8 \| 0.0/6 \| 197 tok/s \| 0.3 GB \| Infinite loop \|
	\| 13 \| Qwopus-27B Q3_K_S \| OOM \| — \| 14+ GB \| Doesn't fit 16GB \|

	## What This Benchmarks

	6 real-world browser agent tasks, not synthetic function-call formatting tests:

	\| # \| Task \| Difficulty \| What it tests \|
	\|---\|------\|-----------\|---------------\|
	\| T1 \| Wikipedia info extraction \| Easy \| Navigate → extract → report \|
	\| T2 \| DuckDuckGo search \| Medium \| Navigate → type → click → read \|
	\| T3 \| Hacker News top story \| Easy \| Navigate → read → stop \|
	\| T4 \| Cat image detection (Falcon Perception) \| Medium \| Navigate → vision_detect → report \|
	\| T5 \| Form filling (httpbin POST) \| Medium \| Navigate → input × 3 → click submit \|
	\| T6 \| reCAPTCHA challenge \| Hard \| Navigate → click → vision → batch click \|

	Each test requires multi-step tool chaining — not single-turn function call formatting.

	## 10 Counter-Intuitive Findings

	1. BFCL ≠ Agent Capability — Bonsai scores 73% on BFCL but 1.0/6 on real agent tasks
	2. Higher Quant ≠ Better for MoE — Gemma4: Q5 (5.0) > Q6 (4.5) > Q8 (4.0)
	3. Higher Quant = Better for Dense — Qwen: Q4 (3.5) < Q6 (5.0)
	4. Uncensored ≠ Better Agent — Quality gains come from quantization, not censoring
	5. Faster Backend ≠ Better Results — GGUF 24 tok/s beats MLX 35 tok/s (proxy issues)
	6. 197 tok/s Model Scores 0/6 — FunctionGemma is useless despite being fastest
	7. 4B MoE = 9B Dense — Gemma4 E4B matches Qwen3.5-9B on agent tasks
	8. 1.2B Specialized > 8B Base — LFM2-1.2B-Tool (4.5/6) > LFM2-8B-A1B (1.0/6)
	9. The "Capability Cliff" Has Exceptions — LFM2-1.2B-Tool breaks the 4B param rule
	10. Small Models Are Context-Starved — Reducing tools 26→8 pushed LFM2 from 4.0→4.5

	## 5-Axis Analysis

	### Axis 1: Model Family
	- Minimum ~4B active params for multi-step agent tasks (with one exception)
	- MoE models (Gemma4 4B active) match dense models (Qwen 9B) at lower cost
	- Liquid Neural Network architecture (LFM2-1.2B-Tool) breaks the 4B rule with specialized training

	### Axis 2: Censoring
	- Uncensored models show no advantage for tool-calling agent tasks
	- Quality improvements are entirely from quantization level, not censoring

	### Axis 3: Quantization
	- MoE models: Q5 is the sweet spot (speed > precision)
	- Dense models: Q6 is the sweet spot (precision > speed)
	- Never go below Q4 or above Q8 for agent tasks

	### Axis 4: Backend
	- llama.cpp GGUF is the universal winner — native tool calling, no proxy
	- MLX is faster but needs a 7-fix proxy for LlmTornado compatibility
	- Ollama has API format issues with Gemma4

	### Axis 5: Vision
	- mmproj + Falcon Perception together score 5.0/6 (best)
	- Either alone scores 4.5/6
	- Falcon Perception (0.6B): 2s/detection, pixel-accurate coordinates

	## Hardware

	- Mac Mini M4 16GB (Apple Silicon)
	- macOS Darwin 24.3.0
	- llama.cpp b8640 (homebrew)
	- Falcon Perception v2 (MLX backend)
	- GUA_Blazor .NET 10 agent framework

	## Architecture

	```
	User Task → GUA_Blazor (agent loop, 25 turns)
	→ LLM (llama.cpp, port 8081) — reasoning + tool calling
	→ Falcon Perception (MLX, port 8090) — vision detection
	→ Playwright Chromium — browser automation
	```

	## Run It Yourself

	### Prerequisites
	```bash
	# macOS with Apple Silicon
	brew install llama.cpp
	pip install falcon-perception # or clone github.com/tiiuae/falcon-perception
	```

	### Quick Speed Test
	```bash
	# Download a model
	huggingface-cli download HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive \
	Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf --local-dir ./models

	# Start server
	llama-server -m ./models/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf \
	--port 8081 -ngl 99 -c 16384

	# Test tool calling
	curl -s http://localhost:8081/v1/chat/completions -H 'Content-Type: application/json' -d '{
	"model": "test",
	"stream": false,
	"messages": [{"role": "user", "content": "Navigate to google.com"}],
	"tools": [{"type": "function", "function": {"name": "browser_use", "description": "Browser", "parameters": {"type": "object", "properties": {"action": {"type": "string"}, "url": {"type": "string"}}, "required": ["action"]}}}]
	}'
	```

	### Run Benchmark
	```bash
	python bench/run_benchmark.py --model ./models/your-model.gguf --mmproj ./models/mmproj.gguf
	```

	## Models Tested

	\| Model \| HuggingFace \| Backend \| mmproj? \|
	\|-------\|-------------\|---------\|---------\|
	\| Gemma4 E4B Uncensored \| [HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive) \| llama.cpp \| Yes (in repo) \|
	\| Qwen3.5-9B Uncensored \| [HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive](https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive) \| llama.cpp \| Yes (in repo) \|
	\| Qwen3.5-9B Base \| [unsloth/Qwen3.5-9B-GGUF](https://huggingface.co/unsloth/Qwen3.5-9B-GGUF) \| llama.cpp \| Yes \|
	\| LFM2-1.2B-Tool \| [LiquidAI/LFM2-1.2B-Tool-GGUF](https://huggingface.co/LiquidAI/LFM2-1.2B-Tool-GGUF) \| llama.cpp \| No (text only) \|
	\| Bonsai-8B \| [prism-ml/Bonsai-8B-gguf](https://huggingface.co/prism-ml/Bonsai-8B-gguf) \| PrismML fork \| No \|
	\| Gemma4 E4B Base (MLX) \| [mlx-community/gemma-4-e4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-e4b-it-4bit) \| mlx_vlm \| Native \|
	\| LFM2-8B-A1B \| [LiquidAI/LFM2-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) \| llama.cpp \| No \|
	\| FunctionGemma 270M \| [unsloth/functiongemma-270m-it-GGUF](https://huggingface.co/unsloth/functiongemma-270m-it-GGUF) \| llama.cpp \| No \|

	## Key Files

	```
	bench/
	run_benchmark.py — Main benchmark runner
	tasks.json — 6 test task definitions
	results/ — Raw results from all runs
	reports/
	FINAL_Report.md — Complete 5-axis analysis
	Multi_Axis_Analysis.md — Detailed breakdown per axis
	Model_Comparison.md — Side-by-side tables
	proxies/
	gemma4_proxy.py — Gemma4 MLX → LlmTornado proxy (7 fixes)
	lfm2_proxy.py — LFM2 pythonic tool-call proxy
	vision/
	falcon_vision_server.py — Falcon Perception 3-layer adaptive pipeline
	```

	## Citation

	If you use this benchmark, please cite:
	```
	@misc{small-llm-agent-bench-2026,
	title={Small LLM Agent Benchmark: Real-World Browser Agent Tasks on 16GB Apple Silicon},
	author={Xavier},
	year={2026},
	url={https://huggingface.co/Manojb/CUA_benchmark_local_small_models}
	}
	```

	## License

	MIT