Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /METRICS_BEFORE_AFTER.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

4.67 kB

TRACe Metrics - Before & After Fixes

Issue #1: Evaluation Logs Appearing Multiple Times

Before ❌

📋 **Evaluation Logs:**
📋 **Evaluation Logs:**
📋 **Evaluation Logs:**  ← Repeated header
⏱️ Evaluation started...
📋 **Evaluation Logs:**
📋 **Evaluation Logs:**
📊 Dataset: hotpotqa

After ✅

📋 Evaluation Logs:      ← Header appears once
⏱️ Evaluation started...
📊 Dataset: hotpotqa
📈 Total samples: 10
🤖 LLM Model: llama-3.1-8b-instant
...

Issue #2: Adherence Metric (Decimal vs Boolean)

Before ❌

Adherence Metric Values:
- Query 1: 0.67  (decimal, not Boolean)
- Query 2: 0.58  (decimal, unclear if grounded)
- Query 3: 0.89  (decimal, hard to interpret)
- Query 4: 0.43  (decimal, is this grounded or not?)

📊 Results:
Adherence: 0.644 (average)  ← Decimal, not Boolean

Problem: Hard to determine if response is grounded or hallucinated.

After ✅

Adherence Metric Values (Boolean):
- Query 1: 1.0  ✅ Fully grounded (>50% of words in docs)
- Query 2: 0.0  ❌ Contains hallucinations (<50% grounding)
- Query 3: 1.0  ✅ Fully grounded
- Query 4: 0.0  ❌ Contains hallucinations

📊 Results:
Adherence: 0.5  (50% of responses grounded)

Benefits:

Clear: 1.0 = trust this response, 0.0 = don't trust it
Binary decision: grounded vs hallucinated
Aligns with RAGBench paper definition

Issue #3: Completeness Always Returning 1.0

Before ❌

Completeness Metric Values:
- Query 1: 1.0  (response has date keyword → score 1.0)
- Query 2: 1.0  (response has location keyword → score 1.0)
- Query 3: 1.0  (response has person name → score 1.0)
- Query 4: 1.0  (response has period keyword → score 1.0)
- Query 5: 1.0  (always 1.0)
- Query 10: 1.0 (always 1.0)

📊 Results:
Completeness: 1.0  (always!)  ← No variation, not informative

Problem: Metric is not discriminative; always returns 1.0

After ✅

Completeness Metric Values:
- Query 1 (When): 0.72  (Ground truth coverage: 40% + length: 1.0 = 0.3*1.0 + 0.7*0.40 = 0.58 avg)
- Query 2 (Where): 0.45  (Ground truth coverage: 15% + others = 0.45)
- Query 3 (Who): 0.88   (Ground truth coverage: 90% + length: 1.0 = 0.3*1.0 + 0.7*0.90 = 0.93 avg)
- Query 4 (What): 0.31  (Ground truth coverage: 10% → low completeness)
- Query 5 (Why): 0.55   (No ground truth, has keywords → 0.7)
- Query 10 (How): 0.62  (Ground truth coverage: 55%)

📊 Results:
Completeness: 0.59  (varies by response quality)  ✅ Informative!

Formula Used:

With ground truth: 0.3 * (length_score) + 0.7 * (overlap_ratio)
Without ground truth: 0.3 (default) or 0.7 (if has answer keywords)

Interpretation:

0.1–0.3 = Poor coverage of relevant info
0.4–0.6 = Moderate coverage
0.7–1.0 = Good coverage of relevant information

Comprehensive Before/After Comparison

Test Case: Query: "When was World War II?"

Before (Broken Metrics) ❌

Retrieved Documents:
  - Doc1: "World War II lasted from 1939 to 1945"
  - Doc2: "About 70 million people died in WW2"
  - Doc3: "The war involved many countries"

Response: "World War II started in 1939 and ended in 1945."

Metrics:
  ├─ Utilization: 0.75  (decimal, somewhat confusing)
  ├─ Relevance: 0.82    (decimal, okay)
  ├─ Adherence: 0.85    ❌ WRONG: Should be Boolean (1.0)
  ├─ Completeness: 1.0  ❌ WRONG: Always 1.0, not informative
  └─ Average: 0.86

After (Fixed Metrics) ✅

Retrieved Documents:
  - Doc1: "World War II lasted from 1939 to 1945"
  - Doc2: "About 70 million people died in WW2"
  - Doc3: "The war involved many countries"

Response: "World War II started in 1939 and ended in 1945."

Ground Truth: "World War II occurred from 1939-1945."

Metrics:
  ├─ Utilization: 0.75  (uses 2/3 docs with good depth)
  ├─ Relevance: 0.82    (retrieved docs are relevant to query)
  ├─ Adherence: 1.0     ✅ CORRECT: Response fully grounded in docs
  ├─ Completeness: 0.85 ✅ CORRECT: Response covers 85% of ground truth info
  └─ Average: 0.85      (reliable score)

Summary of Fixes

Metric	Issue	Before	After	Benefit
Logs	Duplicated	Multiple headers	Single header	Cleaner UI
Adherence	Wrong type	Decimal (0.67)	Boolean (1.0/0.0)	Clear grounding assessment
Completeness	Always max	Always 1.0	Varies (0.3–1.0)	Discriminative scoring

All metrics now align with the RAGBench paper definitions and provide meaningful, actionable insights into RAG system performance. ✅