Align all docs with eval_results JSON source of truth
Browse files
README.md
CHANGED
|
@@ -34,7 +34,7 @@ A recommendation system that refuses to hallucinate.
|
|
| 34 |
|
| 35 |
Product recommendations without explanations are black boxes. Users see "You might like X" but never learn *why*. When you ask an LLM to explain, it confidently invents features and fabricates reviews.
|
| 36 |
|
| 37 |
-
**Sage is different:** Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses.
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
@@ -43,12 +43,12 @@ Product recommendations without explanations are black boxes. Users see "You mig
|
|
| 43 |
| Metric | Target | Achieved | Status |
|
| 44 |
|--------|--------|----------|--------|
|
| 45 |
| NDCG@10 (recommendation quality) | > 0.30 | 0.295 | 98% |
|
| 46 |
-
| Claim-level faithfulness (HHEM) | > 0.85 | 0.
|
| 47 |
-
| Human evaluation (n=50) | > 3.5/5 |
|
| 48 |
-
| P99 latency (
|
| 49 |
-
| P99 latency (cache hit) | < 100ms |
|
| 50 |
|
| 51 |
-
**Grounding impact:** Explanations generated WITH evidence score
|
| 52 |
|
| 53 |
---
|
| 54 |
|
|
@@ -79,7 +79,7 @@ User Query: "wireless earbuds for running"
|
|
| 79 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 80 |
```
|
| 81 |
|
| 82 |
-
**Data flow:** 1M Amazon reviews β 5-core filter β
|
| 83 |
|
| 84 |
---
|
| 85 |
|
|
@@ -91,12 +91,12 @@ When you give an LLM one short review as context, it fills in the gaps with plau
|
|
| 91 |
|
| 92 |
| Decision | Alternative | Why This Choice |
|
| 93 |
|----------|-------------|-----------------|
|
| 94 |
-
| **E5-small** (384-dim) | E5-large, BGE-large |
|
| 95 |
| **Qdrant** | Pinecone, Weaviate | Free cloud tier, payload filtering, clean Python SDK. |
|
| 96 |
-
| **Semantic chunking** | Fixed-window | Preserves complete arguments;
|
| 97 |
-
| **HHEM** (Vectara) | GPT-4 judge, NLI models | Purpose-built for RAG hallucination; no API cost
|
| 98 |
| **Claim-level evaluation** | Full-explanation | Isolates which claims hallucinate; more actionable. |
|
| 99 |
-
| **Quality gate** (refuse) | Always answer |
|
| 100 |
|
| 101 |
---
|
| 102 |
|
|
@@ -108,7 +108,7 @@ When you give an LLM one short review as context, it fills in the gaps with plau
|
|
| 108 |
| **No image features** | Misses visual product attributes | Could add CLIP embeddings in future |
|
| 109 |
| **English only** | Non-English reviews have lower retrieval quality | E5 is primarily English-trained |
|
| 110 |
| **Cache invalidation manual** | Stale explanations possible | TTL-based expiry (1 hour); manual `/cache/clear` |
|
| 111 |
-
| **LLM latency on free tier** | P99 ~4s with explanations | Retrieval alone is
|
| 112 |
| **No user personalization** | Same results for all users | Would need user history for collaborative filtering |
|
| 113 |
|
| 114 |
---
|
|
@@ -230,7 +230,7 @@ scripts/
|
|
| 230 |
| Condition | System Behavior |
|
| 231 |
|-----------|-----------------|
|
| 232 |
| Insufficient evidence (< 2 chunks) | Refuses to explain |
|
| 233 |
-
| Low relevance (top score < 0.
|
| 234 |
| Quote not found in evidence | Falls back to paraphrased claims |
|
| 235 |
| HHEM score < 0.5 | Flags as uncertain |
|
| 236 |
|
|
|
|
| 34 |
|
| 35 |
Product recommendations without explanations are black boxes. Users see "You might like X" but never learn *why*. When you ask an LLM to explain, it confidently invents features and fabricates reviews.
|
| 36 |
|
| 37 |
+
**Sage is different:** Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses. Human evaluation scored trust at **4.3/5** because honesty beats confident fabrication.
|
| 38 |
|
| 39 |
---
|
| 40 |
|
|
|
|
| 43 |
| Metric | Target | Achieved | Status |
|
| 44 |
|--------|--------|----------|--------|
|
| 45 |
| NDCG@10 (recommendation quality) | > 0.30 | 0.295 | 98% |
|
| 46 |
+
| Claim-level faithfulness (HHEM) | > 0.85 | 0.967 | Pass |
|
| 47 |
+
| Human evaluation (n=50) | > 3.5/5 | 3.85/5 | Pass |
|
| 48 |
+
| P99 latency (production) | < 500ms | 200ms | Pass |
|
| 49 |
+
| P99 latency (cache hit) | < 100ms | 86ms | Pass |
|
| 50 |
|
| 51 |
+
**Grounding impact:** Explanations generated WITH evidence score 71% on HHEM. WITHOUT evidence: 2.5%. RAG grounding reduces hallucination by 68 percentage points.
|
| 52 |
|
| 53 |
---
|
| 54 |
|
|
|
|
| 79 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 80 |
```
|
| 81 |
|
| 82 |
+
**Data flow:** 1M Amazon reviews β 5-core filter β 334K reviews β semantic chunking β 423K chunks in Qdrant.
|
| 83 |
|
| 84 |
---
|
| 85 |
|
|
|
|
| 91 |
|
| 92 |
| Decision | Alternative | Why This Choice |
|
| 93 |
|----------|-------------|-----------------|
|
| 94 |
+
| **E5-small** (384-dim) | E5-large, BGE-large | Faster inference, same accuracy on product reviews. Latency > marginal gains. |
|
| 95 |
| **Qdrant** | Pinecone, Weaviate | Free cloud tier, payload filtering, clean Python SDK. |
|
| 96 |
+
| **Semantic chunking** | Fixed-window | Preserves complete arguments; better quote verification. |
|
| 97 |
+
| **HHEM** (Vectara) | GPT-4 judge, NLI models | Purpose-built for RAG hallucination; no API cost. |
|
| 98 |
| **Claim-level evaluation** | Full-explanation | Isolates which claims hallucinate; more actionable. |
|
| 99 |
+
| **Quality gate** (refuse) | Always answer | 48% refusal rate β 4.3/5 trust. Honesty > coverage. |
|
| 100 |
|
| 101 |
---
|
| 102 |
|
|
|
|
| 108 |
| **No image features** | Misses visual product attributes | Could add CLIP embeddings in future |
|
| 109 |
| **English only** | Non-English reviews have lower retrieval quality | E5 is primarily English-trained |
|
| 110 |
| **Cache invalidation manual** | Stale explanations possible | TTL-based expiry (1 hour); manual `/cache/clear` |
|
| 111 |
+
| **LLM latency on free tier** | P99 ~4s with explanations | Retrieval alone is 200ms; cache hits are 86ms |
|
| 112 |
| **No user personalization** | Same results for all users | Would need user history for collaborative filtering |
|
| 113 |
|
| 114 |
---
|
|
|
|
| 230 |
| Condition | System Behavior |
|
| 231 |
|-----------|-----------------|
|
| 232 |
| Insufficient evidence (< 2 chunks) | Refuses to explain |
|
| 233 |
+
| Low relevance (top score < 0.7) | Refuses to explain |
|
| 234 |
| Quote not found in evidence | Falls back to paraphrased claims |
|
| 235 |
| HHEM score < 0.5 | Flags as uncertain |
|
| 236 |
|