Nomearod Claude Opus 4.6 (1M context) commited on
Commit
04cb97f
Β·
1 Parent(s): f0224d3

docs: deepen self-hosted analysis in provider comparison

Browse files

- Restructure as three causal factors ordered by priority: no native
tool calling -> context window constraint -> weak instruction following
- Add keyword hit rate (0.61) as signal of language understanding vs
agent capability gap
- Explain counterintuitive cost finding: self-hosted is more expensive
per-query at low volume due to GPU-second billing model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. docs/provider_comparison.md +32 -24
docs/provider_comparison.md CHANGED
@@ -19,30 +19,38 @@ comparable to each other.
19
  | Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 |
20
  | Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 |
21
 
22
- ## Analysis
23
-
24
- **Retrieval quality:** API models (gpt-4o-mini, claude-haiku) generate substantially better
25
- search queries than Mistral-7B, reflected in P@5 (0.70-0.74 vs 0.05). The 7B model struggles
26
- with prompt-based tool calling β€” it often produces malformed JSON or calls tools with
27
- poor queries, degrading retrieval quality.
28
-
29
- **Citation accuracy:** Both API providers achieve 1.00 citation accuracy (zero hallucinated
30
- citations). Mistral-7B manages 0.14, frequently omitting or fabricating source references.
31
- This is a known limitation of smaller models on instruction-following tasks.
32
-
33
- **Latency:** Self-hosted latency (6,709ms p50) is higher than API providers due to the
34
- proxy overhead and smaller model generating more tokens before reaching a final answer.
35
- Cold start adds ~90s on first request (model download + GPU load).
36
-
37
- **Cost:** Self-hosted cost ($0.0031/query) is computed from GPU-seconds
38
- (latency x Modal A10G rate of $0.000361/sec). This is higher per-query than API providers
39
- at low volume, but the cost model is fundamentally different β€” GPU cost scales with
40
- compute time, not token count.
41
-
42
- **Tool calling:** Mistral-7B does not support native OpenAI-format tool calling in vLLM
43
- 0.6.6. The provider falls back to prompt-based tool selection (injecting tool descriptions
44
- into the system prompt and parsing JSON from the model's text output). This works but is
45
- unreliable β€” a legitimate benchmark finding, not a failure.
 
 
 
 
 
 
 
 
46
 
47
  ## Infrastructure
48
 
 
19
  | Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 |
20
  | Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 |
21
 
22
+ ## Why Mistral-7B Scores So Differently
23
+
24
+ The gap between API providers and self-hosted Mistral-7B compounds from three factors,
25
+ ordered by causal priority:
26
+
27
+ **1. No native tool calling (upstream of everything else).** vLLM 0.6.6 with Mistral-7B
28
+ doesn't support OpenAI-format `tool_calls`. The provider falls back to injecting tool
29
+ descriptions into the system prompt and parsing JSON from the model's text output.
30
+ Mistral-7B frequently produces malformed JSON or calls tools with vague queries like
31
+ `"search"` instead of `"FastAPI dependency injection"` β€” so the retrieval stage gets
32
+ garbage queries, and P@5 collapses to 0.05.
33
+
34
+ **2. Forced single iteration (context window constraint).** API providers get 3 iterations
35
+ (call tool, read result, refine, repeat). Mistral-7B is limited to 1 because each iteration
36
+ adds ~2K tokens of tool results, and the 8K context window fills up. One shot at picking the
37
+ right tool and query, no opportunity to refine.
38
+
39
+ **3. Weak instruction following (residual even when 1 and 2 don't bite).** Even when
40
+ Mistral-7B calls the right tool with a reasonable query, it struggles to follow the citation
41
+ format (`[source: filename.md]`) specified in the system prompt. Citation accuracy is 0.14 β€”
42
+ it's not hallucinating sources, it's mostly omitting them.
43
+
44
+ **Keyword hit rate (0.61) is the interesting signal.** The model *sometimes* generates queries
45
+ with relevant keywords, meaning it has semantic understanding of the questions but can't
46
+ translate that into well-formed tool calls. This is exactly the gap between "understands
47
+ language" and "can operate as an agent."
48
+
49
+ **Cost is counterintuitively higher.** Self-hosted Mistral-7B costs $0.0031/query vs
50
+ $0.0004 for OpenAI gpt-4o-mini β€” despite being self-hosted. Modal A10G time is billed
51
+ per GPU-second, and Mistral-7B takes longer per query while producing worse results.
52
+ The cost advantage of self-hosted only materializes at high throughput with batched
53
+ requests and sustained GPU utilization, not at single-query evaluation scale.
54
 
55
  ## Infrastructure
56