vxa8502 commited on
Commit
1f0ea41
Β·
1 Parent(s): d3d443f

Align all docs with eval_results JSON source of truth

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -34,7 +34,7 @@ A recommendation system that refuses to hallucinate.
34
 
35
  Product recommendations without explanations are black boxes. Users see "You might like X" but never learn *why*. When you ask an LLM to explain, it confidently invents features and fabricates reviews.
36
 
37
- **Sage is different:** Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses. Users rated trust at **4.6/5** because honesty beats confident fabrication.
38
 
39
  ---
40
 
@@ -43,12 +43,12 @@ Product recommendations without explanations are black boxes. Users see "You mig
43
  | Metric | Target | Achieved | Status |
44
  |--------|--------|----------|--------|
45
  | NDCG@10 (recommendation quality) | > 0.30 | 0.295 | 98% |
46
- | Claim-level faithfulness (HHEM) | > 0.85 | 0.952 | Pass |
47
- | Human evaluation (n=50) | > 3.5/5 | 4.43/5 | Pass |
48
- | P99 latency (retrieval) | < 500ms | 283ms | Pass |
49
- | P99 latency (cache hit) | < 100ms | ~80ms | Pass |
50
 
51
- **Grounding impact:** Explanations generated WITH evidence score 69% on HHEM. WITHOUT evidence: 3%. RAG grounding reduces hallucination by 66 percentage points.
52
 
53
  ---
54
 
@@ -79,7 +79,7 @@ User Query: "wireless earbuds for running"
79
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
80
  ```
81
 
82
- **Data flow:** 1M Amazon reviews β†’ 5-core filter β†’ 30K reviews β†’ semantic chunking β†’ 423K chunks in Qdrant.
83
 
84
  ---
85
 
@@ -91,12 +91,12 @@ When you give an LLM one short review as context, it fills in the gaps with plau
91
 
92
  | Decision | Alternative | Why This Choice |
93
  |----------|-------------|-----------------|
94
- | **E5-small** (384-dim) | E5-large, BGE-large | 3x faster, same accuracy on product reviews. Latency > marginal gains. |
95
  | **Qdrant** | Pinecone, Weaviate | Free cloud tier, payload filtering, clean Python SDK. |
96
- | **Semantic chunking** | Fixed-window | Preserves complete arguments; +12% quote verification rate. |
97
- | **HHEM** (Vectara) | GPT-4 judge, NLI models | Purpose-built for RAG hallucination; no API cost; 0.97 AUC. |
98
  | **Claim-level evaluation** | Full-explanation | Isolates which claims hallucinate; more actionable. |
99
- | **Quality gate** (refuse) | Always answer | 46% refusal rate β†’ 4.6/5 trust. Honesty > coverage. |
100
 
101
  ---
102
 
@@ -108,7 +108,7 @@ When you give an LLM one short review as context, it fills in the gaps with plau
108
  | **No image features** | Misses visual product attributes | Could add CLIP embeddings in future |
109
  | **English only** | Non-English reviews have lower retrieval quality | E5 is primarily English-trained |
110
  | **Cache invalidation manual** | Stale explanations possible | TTL-based expiry (1 hour); manual `/cache/clear` |
111
- | **LLM latency on free tier** | P99 ~4s with explanations | Retrieval alone is 283ms; cache hits are ~80ms |
112
  | **No user personalization** | Same results for all users | Would need user history for collaborative filtering |
113
 
114
  ---
@@ -230,7 +230,7 @@ scripts/
230
  | Condition | System Behavior |
231
  |-----------|-----------------|
232
  | Insufficient evidence (< 2 chunks) | Refuses to explain |
233
- | Low relevance (top score < 0.5) | Refuses to explain |
234
  | Quote not found in evidence | Falls back to paraphrased claims |
235
  | HHEM score < 0.5 | Flags as uncertain |
236
 
 
34
 
35
  Product recommendations without explanations are black boxes. Users see "You might like X" but never learn *why*. When you ask an LLM to explain, it confidently invents features and fabricates reviews.
36
 
37
+ **Sage is different:** Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses. Human evaluation scored trust at **4.3/5** because honesty beats confident fabrication.
38
 
39
  ---
40
 
 
43
  | Metric | Target | Achieved | Status |
44
  |--------|--------|----------|--------|
45
  | NDCG@10 (recommendation quality) | > 0.30 | 0.295 | 98% |
46
+ | Claim-level faithfulness (HHEM) | > 0.85 | 0.967 | Pass |
47
+ | Human evaluation (n=50) | > 3.5/5 | 3.85/5 | Pass |
48
+ | P99 latency (production) | < 500ms | 200ms | Pass |
49
+ | P99 latency (cache hit) | < 100ms | 86ms | Pass |
50
 
51
+ **Grounding impact:** Explanations generated WITH evidence score 71% on HHEM. WITHOUT evidence: 2.5%. RAG grounding reduces hallucination by 68 percentage points.
52
 
53
  ---
54
 
 
79
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
80
  ```
81
 
82
+ **Data flow:** 1M Amazon reviews β†’ 5-core filter β†’ 334K reviews β†’ semantic chunking β†’ 423K chunks in Qdrant.
83
 
84
  ---
85
 
 
91
 
92
  | Decision | Alternative | Why This Choice |
93
  |----------|-------------|-----------------|
94
+ | **E5-small** (384-dim) | E5-large, BGE-large | Faster inference, same accuracy on product reviews. Latency > marginal gains. |
95
  | **Qdrant** | Pinecone, Weaviate | Free cloud tier, payload filtering, clean Python SDK. |
96
+ | **Semantic chunking** | Fixed-window | Preserves complete arguments; better quote verification. |
97
+ | **HHEM** (Vectara) | GPT-4 judge, NLI models | Purpose-built for RAG hallucination; no API cost. |
98
  | **Claim-level evaluation** | Full-explanation | Isolates which claims hallucinate; more actionable. |
99
+ | **Quality gate** (refuse) | Always answer | 48% refusal rate β†’ 4.3/5 trust. Honesty > coverage. |
100
 
101
  ---
102
 
 
108
  | **No image features** | Misses visual product attributes | Could add CLIP embeddings in future |
109
  | **English only** | Non-English reviews have lower retrieval quality | E5 is primarily English-trained |
110
  | **Cache invalidation manual** | Stale explanations possible | TTL-based expiry (1 hour); manual `/cache/clear` |
111
+ | **LLM latency on free tier** | P99 ~4s with explanations | Retrieval alone is 200ms; cache hits are 86ms |
112
  | **No user personalization** | Same results for all users | Would need user history for collaborative filtering |
113
 
114
  ---
 
230
  | Condition | System Behavior |
231
  |-----------|-----------------|
232
  | Insufficient evidence (< 2 chunks) | Refuses to explain |
233
+ | Low relevance (top score < 0.7) | Refuses to explain |
234
  | Quote not found in evidence | Falls back to paraphrased claims |
235
  | HHEM score < 0.5 | Flags as uncertain |
236