Spaces:
Running
Provider Comparison: API vs Self-Hosted
Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files. All providers use hybrid retrieval (FAISS + BM25 + RRF), cross-encoder reranking, grounded refusal threshold, and identical system prompt.
Note: The self-hosted config differs from API configs in two ways to accommodate
the 7B model's smaller context window (8192 tokens) and weaker instruction following:
max_iterations=1 (vs 3) and top_k=3 (vs 5). This means the self-hosted row is
not a controlled comparison β it reflects realistic operating constraints for a
7B model, not an apples-to-apples provider swap. The API providers are directly
comparable to each other.
Results
| Provider | Model | Iterations | top_k | P@5 | R@5 | Citation Acc | Latency p50 (ms) | Cost/query |
|---|---|---|---|---|---|---|---|---|
| OpenAI (API) | gpt-4o-mini | 3 | 5 | 0.70 | 0.83 | 1.00 | 4,690 | $0.0004 |
| Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 |
| Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 |
Why Mistral-7B Scores So Differently
The gap between API providers and self-hosted Mistral-7B compounds from three factors, ordered by causal priority:
1. No native tool calling (upstream of everything else). vLLM 0.6.6 with Mistral-7B
doesn't support OpenAI-format tool_calls. The provider falls back to injecting tool
descriptions into the system prompt and parsing JSON from the model's text output.
Mistral-7B frequently produces malformed JSON or calls tools with vague queries like
"search" instead of "FastAPI dependency injection" β so the retrieval stage gets
garbage queries, and P@5 collapses to 0.05.
2. Forced single iteration (context window constraint). API providers get 3 iterations (call tool, read result, refine, repeat). Mistral-7B is limited to 1 because each iteration adds ~2K tokens of tool results, and the 8K context window fills up. One shot at picking the right tool and query, no opportunity to refine.
3. Weak instruction following (residual even when 1 and 2 don't bite). Even when
Mistral-7B calls the right tool with a reasonable query, it struggles to follow the citation
format ([source: filename.md]) specified in the system prompt. Citation accuracy is 0.14 β
it's not hallucinating sources, it's mostly omitting them.
Keyword hit rate (0.61) is the interesting signal. The model sometimes generates queries with relevant keywords, meaning it has semantic understanding of the questions but can't translate that into well-formed tool calls. This is exactly the gap between "understands language" and "can operate as an agent."
Cost is counterintuitively higher. Self-hosted Mistral-7B costs $0.0031/query vs $0.0004 for OpenAI gpt-4o-mini β despite being self-hosted. Modal A10G time is billed per GPU-second, and Mistral-7B takes longer per query while producing worse results. The cost advantage of self-hosted only materializes at high throughput with batched requests and sustained GPU utilization, not at single-query evaluation scale.
Infrastructure
| Config | Cold start | Warm latency p50 | GPU | Infra |
|---|---|---|---|---|
| OpenAI | N/A | 4,690 ms | N/A | Managed API |
| Anthropic | N/A | 5,120 ms | N/A | Managed API |
| Self-hosted (Modal) | ~90s | 6,709 ms | A10G (24GB) | Serverless GPU |
How to Reproduce
# OpenAI evaluation
OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic
# Anthropic evaluation
ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic
# Self-hosted evaluation (requires Modal deployment + HF secret)
pip install -e ".[modal]"
modal secret create huggingface-secret HF_TOKEN=hf_...
modal deploy modal/serve_vllm.py
export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1
python scripts/evaluate.py --config configs/selfhosted_modal.yaml --mode deterministic
# All providers at once
make benchmark-all
Known Limitations & Future Work
The self-hosted benchmark is not a controlled comparison. Three specific constraints disadvantage Mistral-7B beyond its inherent model quality:
Prompt-based tool calling (fixable). vLLM 0.6.6 was pinned to work around
huggingface_hubandtransformersdependency conflicts. Newer vLLM versions (0.8+) support native Mistral tool calling via--enable-auto-tool-choice --tool-call-parser mistral. This would eliminate the malformed-JSON failure mode that drives P@5 to 0.05.Artificially low context window (fixable). Mistral-7B supports 32K context natively, but
max_model_lenis set to 8K to fit A10G memory atgpu_memory_utilization=0.85. Bumping to 16K (with0.90utilization) would likely allow restoringmax_iterations=3andtop_k=5to match the API configs β making the comparison controlled.Model scale (architectural). Even with fixes 1 and 2, a 7B model will underperform gpt-4o-mini and claude-haiku on multi-step agentic tasks. A fairer model-size comparison would use Mixtral-8x7B or Llama-3-70B (requiring A100 80GB). This would refine the model-size floor estimate but not change the architectural conclusion.
Takeaway
The provider abstraction works as designed β switching providers is a single config change. API models dominate on quality metrics, but the self-hosted path demonstrates end-to-end inference serving: vLLM on Modal (serverless A10G), OpenAI-compatible endpoint, identical evaluation harness. The quality gap is a combination of model scale and infrastructure constraints, both of which are documented and addressable.
Generated by modal/run_benchmark.py