title: Sage
emoji: π¦
colorFrom: blue
colorTo: yellow
sdk: docker
app_port: 7860
Sage
A recommendation system that refuses to hallucinate. Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses.
{
"query": "budget bluetooth headphones",
"recommendations": [{
"explanation": "Reviewers say \"For $18 Bluetooth headphones there is no better pair\" [review_141313]...",
"confidence": {"hhem_score": 0.78, "is_grounded": true},
"citations_verified": true
}]
}
Try it: vxa8502-sage.hf.space
Results
| Metric | Target | Achieved | Notes |
|---|---|---|---|
| NDCG@10 | > 0.30 | 0.487 [0.37, 0.60] | 95% CI via bootstrap (n=42 queries) |
| Faithfulness (claim-level HHEM) | > 0.85 | 0.968 | 96.8% of individual quoted claims verified |
| Faithfulness (full-explanation HHEM) | - | 0.20 | 20% of full explanations pass; stricter but penalizes refusals |
| Faithfulness (RAGAS) | - | 0.50 | Penalizes citation-heavy style and graceful refusals |
| Human evaluation | > 3.5/5 | 3.6/5 | Single-rater; no inter-rater reliability |
| P99 latency | < 500ms | 283ms | Production load test |
Faithfulness metrics: Three measurements capture different aspects. Claim-level HHEM (96.8%) validates each quoted claim individually - the primary metric since Sage uses explicit citations. Full-explanation HHEM (20%) and RAGAS (0.50) score entire responses holistically, penalizing graceful refusals as "failures." All three reported for transparency.
Human evaluation limitation: Single-rater (developer). Usefulness (3.06) and satisfaction (3.04) have high variance (std ~1.7), suggesting inconsistent quality across query types.
Architecture
User Query: "wireless earbuds for running"
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SAGE API (FastAPI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. EMBED β E5-small (384-dim) ~20ms β
β 2. CACHE CHECK β Exact + semantic (0.92 sim) ~1ms β
β 3. RETRIEVE β Qdrant vector search ~50ms β
β 4. AGGREGATE β Chunk β Product (MAX score) ~1ms β
β 5. EXPLAIN β Claude/GPT + evidence ~300ms β
β 6. VERIFY β HHEM hallucination check ~50ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Response: β
β - Product ID + score β
β - Explanation with [citations] β
β - HHEM confidence score β
β - Quote verification results β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data flow: Amazon Electronics reviews (1M raw) β 5-core filter β 334K reviews β semantic chunking β 423K chunks in Qdrant. (pipeline.py | Kaggle notebook)
Limitations
| Constraint | Behavior |
|---|---|
| Insufficient evidence (< 2 chunks) | Refuses to explain |
| Low relevance (top score < 0.7) | Refuses to explain |
| Single category (Electronics) | Architecture supports multi-category; data constraint only |
| No image features | Text-only retrieval |
| English only | E5 primarily English-trained |
| Cold start | First request ~10s (HF wake), then P99 < 500ms |
Quick Start
git clone https://github.com/vxa8502/sage-recommendations && cd sage-recommendations
cp .env.example .env # Then add ANTHROPIC_API_KEY or OPENAI_API_KEY
Docker: docker compose up
Local:
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,pipeline,api,anthropic]"
make qdrant-up && make data && make serve
API Reference
POST /recommend
curl -X POST https://vxa8502-sage.hf.space/recommend \
-H "Content-Type: application/json" \
-d '{"query": "wireless earbuds for running", "k": 3, "explain": true}'
Returns products with grounded explanations, HHEM confidence scores, and verified citations.
POST /recommend/stream
Server-sent events for token-by-token explanation streaming.
GET /health, /metrics, /cache/stats
Health check, Prometheus metrics, and cache statistics.
Evaluation
make eval # ~5 min: standard pre-commit
make eval-full # ~17 min: complete suite + load test
Project Structure (Key Directories)
sage/
βββ adapters/ # External integrations (Qdrant, LLM, HHEM)
βββ api/ # FastAPI routes, middleware, Prometheus metrics
βββ config/ # Settings, logging, query templates
βββ core/ # Domain models, aggregation, verification, chunking
βββ services/ # Business logic (retrieval, explanation, cache)
scripts/
βββ pipeline.py # Data ingestion and embedding
βββ evaluation.py # NDCG, precision, recall, novelty, baselines
βββ faithfulness.py # HHEM, RAGAS, grounding delta
βββ human_eval.py # Interactive human evaluation
βββ load_test.py # P99 latency benchmarking
License
Academic/portfolio use only. Uses Amazon Reviews 2023 dataset.