Spaces:
Running
Running
docs: deepen self-hosted analysis in provider comparison
Browse files- Restructure as three causal factors ordered by priority: no native
tool calling -> context window constraint -> weak instruction following
- Add keyword hit rate (0.61) as signal of language understanding vs
agent capability gap
- Explain counterintuitive cost finding: self-hosted is more expensive
per-query at low volume due to GPU-second billing model
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- docs/provider_comparison.md +32 -24
docs/provider_comparison.md
CHANGED
|
@@ -19,30 +19,38 @@ comparable to each other.
|
|
| 19 |
| Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 |
|
| 20 |
| Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 |
|
| 21 |
|
| 22 |
-
##
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## Infrastructure
|
| 48 |
|
|
|
|
| 19 |
| Anthropic (API) | claude-haiku-4-5 | 3 | 5 | 0.74 | 0.84 | 1.00 | 5,120 | $0.0007 |
|
| 20 |
| Self-hosted (Modal) | Mistral-7B-Instruct-v0.3 | 1 | 3 | 0.05 | 0.05 | 0.14 | 6,709 | $0.0031 |
|
| 21 |
|
| 22 |
+
## Why Mistral-7B Scores So Differently
|
| 23 |
+
|
| 24 |
+
The gap between API providers and self-hosted Mistral-7B compounds from three factors,
|
| 25 |
+
ordered by causal priority:
|
| 26 |
+
|
| 27 |
+
**1. No native tool calling (upstream of everything else).** vLLM 0.6.6 with Mistral-7B
|
| 28 |
+
doesn't support OpenAI-format `tool_calls`. The provider falls back to injecting tool
|
| 29 |
+
descriptions into the system prompt and parsing JSON from the model's text output.
|
| 30 |
+
Mistral-7B frequently produces malformed JSON or calls tools with vague queries like
|
| 31 |
+
`"search"` instead of `"FastAPI dependency injection"` β so the retrieval stage gets
|
| 32 |
+
garbage queries, and P@5 collapses to 0.05.
|
| 33 |
+
|
| 34 |
+
**2. Forced single iteration (context window constraint).** API providers get 3 iterations
|
| 35 |
+
(call tool, read result, refine, repeat). Mistral-7B is limited to 1 because each iteration
|
| 36 |
+
adds ~2K tokens of tool results, and the 8K context window fills up. One shot at picking the
|
| 37 |
+
right tool and query, no opportunity to refine.
|
| 38 |
+
|
| 39 |
+
**3. Weak instruction following (residual even when 1 and 2 don't bite).** Even when
|
| 40 |
+
Mistral-7B calls the right tool with a reasonable query, it struggles to follow the citation
|
| 41 |
+
format (`[source: filename.md]`) specified in the system prompt. Citation accuracy is 0.14 β
|
| 42 |
+
it's not hallucinating sources, it's mostly omitting them.
|
| 43 |
+
|
| 44 |
+
**Keyword hit rate (0.61) is the interesting signal.** The model *sometimes* generates queries
|
| 45 |
+
with relevant keywords, meaning it has semantic understanding of the questions but can't
|
| 46 |
+
translate that into well-formed tool calls. This is exactly the gap between "understands
|
| 47 |
+
language" and "can operate as an agent."
|
| 48 |
+
|
| 49 |
+
**Cost is counterintuitively higher.** Self-hosted Mistral-7B costs $0.0031/query vs
|
| 50 |
+
$0.0004 for OpenAI gpt-4o-mini β despite being self-hosted. Modal A10G time is billed
|
| 51 |
+
per GPU-second, and Mistral-7B takes longer per query while producing worse results.
|
| 52 |
+
The cost advantage of self-hosted only materializes at high throughput with batched
|
| 53 |
+
requests and sustained GPU utilization, not at single-query evaluation scale.
|
| 54 |
|
| 55 |
## Infrastructure
|
| 56 |
|