Spaces:
Sleeping
Sleeping
| # Provider Comparison: API vs Self-Hosted | |
| Evaluated on the same 27-question golden dataset over 16 FastAPI documentation files. | |
| All providers use hybrid retrieval (FAISS + BM25 + RRF), cross-encoder reranking, | |
| grounded refusal threshold, and identical system prompt. | |
| **Note:** The self-hosted config differs from API configs in two ways to accommodate | |
| the 7B model's smaller context window (8192 tokens) and weaker instruction following: | |
| `max_iterations=1` (vs 3) and `top_k=3` (vs 5). This means the self-hosted row is | |
| **not a controlled comparison** β it reflects realistic operating constraints for a | |
| 7B model, not an apples-to-apples provider swap. The API providers are directly | |
| comparable to each other. | |
| ## Results | |
| | Provider | Model | Iterations | top_k | P@5 | R@5 | Citation Acc | Latency p50 (ms) | Cost/query | | |
| |----------|-------|-----------|-------|-----|-----|--------------|-------------------|------------| | |
| | OpenAI (API) | gpt-4o-mini | 3 | 5 | 0.70 | 0.83 | 1.00 | 4,690 | $0.0004 | | |
| | Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 | | |
| | Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 | | |
| ## Why Mistral-7B Scores So Differently | |
| The gap between API providers and self-hosted Mistral-7B compounds from three factors, | |
| ordered by causal priority: | |
| **1. No native tool calling (upstream of everything else).** vLLM 0.6.6 with Mistral-7B | |
| doesn't support OpenAI-format `tool_calls`. The provider falls back to injecting tool | |
| descriptions into the system prompt and parsing JSON from the model's text output. | |
| Mistral-7B frequently produces malformed JSON or calls tools with vague queries like | |
| `"search"` instead of `"FastAPI dependency injection"` β so the retrieval stage gets | |
| garbage queries, and P@5 collapses to 0.05. | |
| **2. Forced single iteration (context window constraint).** API providers get 3 iterations | |
| (call tool, read result, refine, repeat). Mistral-7B is limited to 1 because each iteration | |
| adds ~2K tokens of tool results, and the 8K context window fills up. One shot at picking the | |
| right tool and query, no opportunity to refine. | |
| **3. Weak instruction following (residual even when 1 and 2 don't bite).** Even when | |
| Mistral-7B calls the right tool with a reasonable query, it struggles to follow the citation | |
| format (`[source: filename.md]`) specified in the system prompt. Citation accuracy is 0.14 β | |
| it's not hallucinating sources, it's mostly omitting them. | |
| **Keyword hit rate (0.61) is the interesting signal.** The model *sometimes* generates queries | |
| with relevant keywords, meaning it has semantic understanding of the questions but can't | |
| translate that into well-formed tool calls. This is exactly the gap between "understands | |
| language" and "can operate as an agent." | |
| **Cost is counterintuitively higher.** Self-hosted Mistral-7B costs $0.0031/query vs | |
| $0.0004 for OpenAI gpt-4o-mini β despite being self-hosted. Modal A10G time is billed | |
| per GPU-second, and Mistral-7B takes longer per query while producing worse results. | |
| The cost advantage of self-hosted only materializes at high throughput with batched | |
| requests and sustained GPU utilization, not at single-query evaluation scale. | |
| ## Infrastructure | |
| | Config | Cold start | Warm latency p50 | GPU | Infra | | |
| |--------|-----------|-------------------|-----|-------| | |
| | OpenAI | N/A | 4,690 ms | N/A | Managed API | | |
| | Anthropic | N/A | 5,120 ms | N/A | Managed API | | |
| | Self-hosted (Modal) | ~90s | 6,709 ms | A10G (24GB) | Serverless GPU | | |
| ## How to Reproduce | |
| ```bash | |
| # OpenAI evaluation | |
| OPENAI_API_KEY=sk-... python scripts/evaluate.py --mode deterministic | |
| # Anthropic evaluation | |
| ANTHROPIC_API_KEY=sk-ant-... python scripts/evaluate.py --config configs/anthropic.yaml --mode deterministic | |
| # Self-hosted evaluation (requires Modal deployment + HF secret) | |
| pip install -e ".[modal]" | |
| modal secret create huggingface-secret HF_TOKEN=hf_... | |
| modal deploy modal/serve_vllm.py | |
| export MODAL_VLLM_URL=https://your--agent-bench-vllm-serve.modal.run/v1 | |
| python scripts/evaluate.py --config configs/selfhosted_modal.yaml --mode deterministic | |
| # All providers at once | |
| make benchmark-all | |
| ``` | |
| ## Known Limitations & Future Work | |
| The self-hosted benchmark is not a controlled comparison. Three specific constraints | |
| disadvantage Mistral-7B beyond its inherent model quality: | |
| 1. **Prompt-based tool calling (fixable).** vLLM 0.6.6 was pinned to work around | |
| `huggingface_hub` and `transformers` dependency conflicts. Newer vLLM versions (0.8+) | |
| support native Mistral tool calling via `--enable-auto-tool-choice --tool-call-parser mistral`. | |
| This would eliminate the malformed-JSON failure mode that drives P@5 to 0.05. | |
| 2. **Artificially low context window (fixable).** Mistral-7B supports 32K context natively, | |
| but `max_model_len` is set to 8K to fit A10G memory at `gpu_memory_utilization=0.85`. | |
| Bumping to 16K (with `0.90` utilization) would likely allow restoring `max_iterations=3` | |
| and `top_k=5` to match the API configs β making the comparison controlled. | |
| 3. **Model scale (architectural).** Even with fixes 1 and 2, a 7B model will underperform | |
| gpt-4o-mini and claude-haiku on multi-step agentic tasks. A fairer model-size comparison | |
| would use Mixtral-8x7B or Llama-3-70B (requiring A100 80GB). This would refine the | |
| model-size floor estimate but not change the architectural conclusion. | |
| ## Takeaway | |
| The provider abstraction works as designed β switching providers is a single config change. | |
| API models dominate on quality metrics, but the self-hosted path demonstrates end-to-end | |
| inference serving: vLLM on Modal (serverless A10G), OpenAI-compatible endpoint, identical | |
| evaluation harness. The quality gap is a combination of model scale and infrastructure | |
| constraints, both of which are documented and addressable. | |
| --- | |
| Generated by `modal/run_benchmark.py` | |