Spaces:

Nomearod
/

agentbench

Sleeping

App Files Files Community

agentbench / docs /provider_comparison.md

Nomearod

docs: add known limitations and future work for self-hosted benchmark

79e4ae8 about 2 months ago

preview code

raw

history blame contribute delete

5.87 kB

	# Provider Comparison: API vs Self-Hosted

	Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files.
	All providers use hybrid retrieval (FAISS + BM25 + RRF), cross-encoder reranking,
	grounded refusal threshold, and identical system prompt.

	Note: The self-hosted config differs from API configs in two ways to accommodate
	the 7B model's smaller context window (8192 tokens) and weaker instruction following:
	`max_iterations=1` (vs 3) and `top_k=3` (vs 5). This means the self-hosted row is
	not a controlled comparison — it reflects realistic operating constraints for a
	7B model, not an apples-to-apples provider swap. The API providers are directly
	comparable to each other.

	## Results

	\| Provider \| Model \| Iterations \| top_k \| P@5 \| R@5 \| Citation Acc \| Latency p50 (ms) \| Cost/query \|
	\|----------\|-------\|-----------\|-------\|-----\|-----\|--------------\|-------------------\|------------\|
	\| OpenAI (API) \| gpt-4o-mini \| 3 \| 5 \| 0.70 \| 0.83 \| 1.00 \| 4,690 \| $0.0004 \|
	\| Anthropic (API) \| claude-haiku-4-5 \| 3 \| 5 \| 0.74 \| 0.84 \| 1.00 \| 5,120 \| $0.0007 \|
	\| Self-hosted (Modal) \| Mistral-7B-Instruct-v0.3 \| 1 \| 3 \| 0.05 \| 0.05 \| 0.14 \| 6,709 \| $0.0031 \|

	## Why Mistral-7B Scores So Differently

	The gap between API providers and self-hosted Mistral-7B compounds from three factors,
	ordered by causal priority:

	1. No native tool calling (upstream of everything else). vLLM 0.6.6 with Mistral-7B
	doesn't support OpenAI-format `tool_calls`. The provider falls back to injecting tool
	descriptions into the system prompt and parsing JSON from the model's text output.
	Mistral-7B frequently produces malformed JSON or calls tools with vague queries like
	`"search"` instead of `"FastAPI dependency injection"` — so the retrieval stage gets
	garbage queries, and P@5 collapses to 0.05.

	2. Forced single iteration (context window constraint). API providers get 3 iterations
	(call tool, read result, refine, repeat). Mistral-7B is limited to 1 because each iteration
	adds ~2K tokens of tool results, and the 8K context window fills up. One shot at picking the
	right tool and query, no opportunity to refine.

	3. Weak instruction following (residual even when 1 and 2 don't bite). Even when
	Mistral-7B calls the right tool with a reasonable query, it struggles to follow the citation
	format (`[source: filename.md]`) specified in the system prompt. Citation accuracy is 0.14 —
	it's not hallucinating sources, it's mostly omitting them.

	Keyword hit rate (0.61) is the interesting signal. The model sometimes generates queries
	with relevant keywords, meaning it has semantic understanding of the questions but can't
	translate that into well-formed tool calls. This is exactly the gap between "understands
	language" and "can operate as an agent."

	Cost is counterintuitively higher. Self-hosted Mistral-7B costs $0.0031/query vs
	$0.0004 for OpenAI gpt-4o-mini — despite being self-hosted. Modal A10G time is billed
	per GPU-second, and Mistral-7B takes longer per query while producing worse results.
	The cost advantage of self-hosted only materializes at high throughput with batched
	requests and sustained GPU utilization, not at single-query evaluation scale.

	## Infrastructure

	\| Config \| Cold start \| Warm latency p50 \| GPU \| Infra \|
	\|--------\|-----------\|-------------------\|-----\|-------\|
	\| OpenAI \| N/A \| 4,690 ms \| N/A \| Managed API \|
	\| Anthropic \| N/A \| 5,120 ms \| N/A \| Managed API \|
	\| Self-hosted (Modal) \| ~90s \| 6,709 ms \| A10G (24GB) \| Serverless GPU \|

	## How to Reproduce

	```bash
	# OpenAI evaluation
	OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic

	# Anthropic evaluation
	ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic

	# Self-hosted evaluation (requires Modal deployment + HF secret)
	pip install -e ".[modal]"
	modal secret create huggingface-secret HF_TOKEN=hf_...
	modal deploy modal/serve_vllm.py
	export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1
	python scripts/evaluate.py --config configs/selfhosted_modal.yaml --mode deterministic

	# All providers at once
	make benchmark-all
	```

	## Known Limitations & Future Work

	The self-hosted benchmark is not a controlled comparison. Three specific constraints
	disadvantage Mistral-7B beyond its inherent model quality:

	1. Prompt-based tool calling (fixable). vLLM 0.6.6 was pinned to work around
	`huggingface_hub` and `transformers` dependency conflicts. Newer vLLM versions (0.8+)
	support native Mistral tool calling via `--enable-auto-tool-choice --tool-call-parser mistral`.
	This would eliminate the malformed-JSON failure mode that drives P@5 to 0.05.

	2. Artificially low context window (fixable). Mistral-7B supports 32K context natively,
	but `max_model_len` is set to 8K to fit A10G memory at `gpu_memory_utilization=0.85`.
	Bumping to 16K (with `0.90` utilization) would likely allow restoring `max_iterations=3`
	and `top_k=5` to match the API configs — making the comparison controlled.

	3. Model scale (architectural). Even with fixes 1 and 2, a 7B model will underperform
	gpt-4o-mini and claude-haiku on multi-step agentic tasks. A fairer model-size comparison
	would use Mixtral-8x7B or Llama-3-70B (requiring A100 80GB). This would refine the
	model-size floor estimate but not change the architectural conclusion.

	## Takeaway

	The provider abstraction works as designed — switching providers is a single config change.
	API models dominate on quality metrics, but the self-hosted path demonstrates end-to-end
	inference serving: vLLM on Modal (serverless A10G), OpenAI-compatible endpoint, identical
	evaluation harness. The quality gap is a combination of model scale and infrastructure
	constraints, both of which are documented and addressable.

	---

	Generated by `modal/run_benchmark.py`