Sage / README.md
vxa8502's picture
Improve metric reporting
adbdbc8
metadata
title: Sage
emoji: πŸ¦‰
colorFrom: blue
colorTo: yellow
sdk: docker
app_port: 7860

CI Python 3.11+ Demo

Sage

A recommendation system that refuses to hallucinate. Every claim is a verified quote from real customer reviews. When evidence is sparse, it refuses rather than guesses.

{
  "query": "budget bluetooth headphones",
  "recommendations": [{
    "explanation": "Reviewers say \"For $18 Bluetooth headphones there is no better pair\" [review_141313]...",
    "confidence": {"hhem_score": 0.78, "is_grounded": true},
    "citations_verified": true
  }]
}

Try it: vxa8502-sage.hf.space


Results

Metric Target Achieved Notes
NDCG@10 > 0.30 0.487 [0.37, 0.60] 95% CI via bootstrap (n=42 queries)
Faithfulness (claim-level HHEM) > 0.85 0.968 96.8% of individual quoted claims verified
Faithfulness (full-explanation HHEM) - 0.20 20% of full explanations pass; stricter but penalizes refusals
Faithfulness (RAGAS) - 0.50 Penalizes citation-heavy style and graceful refusals
Human evaluation > 3.5/5 3.6/5 Single-rater; no inter-rater reliability
P99 latency < 500ms 283ms Production load test

Faithfulness metrics: Three measurements capture different aspects. Claim-level HHEM (96.8%) validates each quoted claim individually - the primary metric since Sage uses explicit citations. Full-explanation HHEM (20%) and RAGAS (0.50) score entire responses holistically, penalizing graceful refusals as "failures." All three reported for transparency.

Human evaluation limitation: Single-rater (developer). Usefulness (3.06) and satisfaction (3.04) have high variance (std ~1.7), suggesting inconsistent quality across query types.


Architecture

User Query: "wireless earbuds for running"
                    β”‚
                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      SAGE API (FastAPI)                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  1. EMBED         β”‚  E5-small (384-dim)           ~20ms    β”‚
β”‚  2. CACHE CHECK   β”‚  Exact + semantic (0.92 sim)  ~1ms     β”‚
β”‚  3. RETRIEVE      β”‚  Qdrant vector search         ~50ms    β”‚
β”‚  4. AGGREGATE     β”‚  Chunk β†’ Product (MAX score)  ~1ms     β”‚
β”‚  5. EXPLAIN       β”‚  Claude/GPT + evidence        ~300ms   β”‚
β”‚  6. VERIFY        β”‚  HHEM hallucination check     ~50ms    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Response:                                                  β”‚
β”‚  - Product ID + score                                       β”‚
β”‚  - Explanation with [citations]                             β”‚
β”‚  - HHEM confidence score                                    β”‚
β”‚  - Quote verification results                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data flow: Amazon Electronics reviews (1M raw) β†’ 5-core filter β†’ 334K reviews β†’ semantic chunking β†’ 423K chunks in Qdrant. (pipeline.py | Kaggle notebook)


Limitations

Constraint Behavior
Insufficient evidence (< 2 chunks) Refuses to explain
Low relevance (top score < 0.7) Refuses to explain
Single category (Electronics) Architecture supports multi-category; data constraint only
No image features Text-only retrieval
English only E5 primarily English-trained
Cold start First request ~10s (HF wake), then P99 < 500ms

Quick Start

git clone https://github.com/vxa8502/sage-recommendations && cd sage-recommendations
cp .env.example .env   # Then add ANTHROPIC_API_KEY or OPENAI_API_KEY

Docker: docker compose up

Local:

python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,pipeline,api,anthropic]"
make qdrant-up && make data && make serve

API Reference

POST /recommend

curl -X POST https://vxa8502-sage.hf.space/recommend \
  -H "Content-Type: application/json" \
  -d '{"query": "wireless earbuds for running", "k": 3, "explain": true}'

Returns products with grounded explanations, HHEM confidence scores, and verified citations.

POST /recommend/stream

Server-sent events for token-by-token explanation streaming.

GET /health, /metrics, /cache/stats

Health check, Prometheus metrics, and cache statistics.


Evaluation

make eval          # ~5 min: standard pre-commit
make eval-full     # ~17 min: complete suite + load test

Project Structure (Key Directories)

sage/
β”œβ”€β”€ adapters/       # External integrations (Qdrant, LLM, HHEM)
β”œβ”€β”€ api/            # FastAPI routes, middleware, Prometheus metrics
β”œβ”€β”€ config/         # Settings, logging, query templates
β”œβ”€β”€ core/           # Domain models, aggregation, verification, chunking
β”œβ”€β”€ services/       # Business logic (retrieval, explanation, cache)
scripts/
β”œβ”€β”€ pipeline.py     # Data ingestion and embedding
β”œβ”€β”€ evaluation.py   # NDCG, precision, recall, novelty, baselines
β”œβ”€β”€ faithfulness.py # HHEM, RAGAS, grounding delta
β”œβ”€β”€ human_eval.py   # Interactive human evaluation
β”œβ”€β”€ load_test.py    # P99 latency benchmarking

License

Academic/portfolio use only. Uses Amazon Reviews 2023 dataset.