agentbench / docs /provider_comparison.md
Nomearod's picture
docs: add known limitations and future work for self-hosted benchmark
79e4ae8

Provider Comparison: API vs Self-Hosted

Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files. All providers use hybrid retrieval (FAISS + BM25 + RRF), cross-encoder reranking, grounded refusal threshold, and identical system prompt.

Note: The self-hosted config differs from API configs in two ways to accommodate the 7B model's smaller context window (8192 tokens) and weaker instruction following: max_iterations=1 (vs 3) and top_k=3 (vs 5). This means the self-hosted row is not a controlled comparison β€” it reflects realistic operating constraints for a 7B model, not an apples-to-apples provider swap. The API providers are directly comparable to each other.

Results

Provider Model Iterations top_k P@5 R@5 Citation Acc Latency p50 (ms) Cost/query
OpenAI (API) gpt-4o-mini 3 5 0.70 0.83 1.00 4,690 $0.0004
Anthropic (API) claude-haiku-4-5 3 5 0.74 0.84 1.00 5,120 $0.0007
Self-hosted (Modal) Mistral-7B-Instruct-v0.3 1 3 0.05 0.05 0.14 6,709 $0.0031

Why Mistral-7B Scores So Differently

The gap between API providers and self-hosted Mistral-7B compounds from three factors, ordered by causal priority:

1. No native tool calling (upstream of everything else). vLLM 0.6.6 with Mistral-7B doesn't support OpenAI-format tool_calls. The provider falls back to injecting tool descriptions into the system prompt and parsing JSON from the model's text output. Mistral-7B frequently produces malformed JSON or calls tools with vague queries like "search" instead of "FastAPI dependency injection" β€” so the retrieval stage gets garbage queries, and P@5 collapses to 0.05.

2. Forced single iteration (context window constraint). API providers get 3 iterations (call tool, read result, refine, repeat). Mistral-7B is limited to 1 because each iteration adds ~2K tokens of tool results, and the 8K context window fills up. One shot at picking the right tool and query, no opportunity to refine.

3. Weak instruction following (residual even when 1 and 2 don't bite). Even when Mistral-7B calls the right tool with a reasonable query, it struggles to follow the citation format ([source: filename.md]) specified in the system prompt. Citation accuracy is 0.14 β€” it's not hallucinating sources, it's mostly omitting them.

Keyword hit rate (0.61) is the interesting signal. The model sometimes generates queries with relevant keywords, meaning it has semantic understanding of the questions but can't translate that into well-formed tool calls. This is exactly the gap between "understands language" and "can operate as an agent."

Cost is counterintuitively higher. Self-hosted Mistral-7B costs $0.0031/query vs $0.0004 for OpenAI gpt-4o-mini β€” despite being self-hosted. Modal A10G time is billed per GPU-second, and Mistral-7B takes longer per query while producing worse results. The cost advantage of self-hosted only materializes at high throughput with batched requests and sustained GPU utilization, not at single-query evaluation scale.

Infrastructure

Config Cold start Warm latency p50 GPU Infra
OpenAI N/A 4,690 ms N/A Managed API
Anthropic N/A 5,120 ms N/A Managed API
Self-hosted (Modal) ~90s 6,709 ms A10G (24GB) Serverless GPU

How to Reproduce

# OpenAI evaluation
OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic

# Anthropic evaluation
ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic

# Self-hosted evaluation (requires Modal deployment + HF secret)
pip install -e ".[modal]"
modal secret create huggingface-secret HF_TOKEN=hf_...
modal deploy modal/serve_vllm.py
export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1
python scripts/evaluate.py --config configs/selfhosted_modal.yaml --mode deterministic

# All providers at once
make benchmark-all

Known Limitations & Future Work

The self-hosted benchmark is not a controlled comparison. Three specific constraints disadvantage Mistral-7B beyond its inherent model quality:

  1. Prompt-based tool calling (fixable). vLLM 0.6.6 was pinned to work around huggingface_hub and transformers dependency conflicts. Newer vLLM versions (0.8+) support native Mistral tool calling via --enable-auto-tool-choice --tool-call-parser mistral. This would eliminate the malformed-JSON failure mode that drives P@5 to 0.05.

  2. Artificially low context window (fixable). Mistral-7B supports 32K context natively, but max_model_len is set to 8K to fit A10G memory at gpu_memory_utilization=0.85. Bumping to 16K (with 0.90 utilization) would likely allow restoring max_iterations=3 and top_k=5 to match the API configs β€” making the comparison controlled.

  3. Model scale (architectural). Even with fixes 1 and 2, a 7B model will underperform gpt-4o-mini and claude-haiku on multi-step agentic tasks. A fairer model-size comparison would use Mixtral-8x7B or Llama-3-70B (requiring A100 80GB). This would refine the model-size floor estimate but not change the architectural conclusion.

Takeaway

The provider abstraction works as designed β€” switching providers is a single config change. API models dominate on quality metrics, but the self-hosted path demonstrates end-to-end inference serving: vLLM on Modal (serverless A10G), OpenAI-compatible endpoint, identical evaluation harness. The quality gap is a combination of model scale and infrastructure constraints, both of which are documented and addressable.


Generated by modal/run_benchmark.py