CapStoneRAG10 / docs /ADHERENCE_METRIC_FIX.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

Adherence Metric Fix - Summary

Problem

The adherence metric was returning decimal values (e.g., 0.333, 0.667, 0.8) instead of boolean values (0.0 or 1.0) as defined in the RAGBench paper.

Root Cause

The _compute_adherence() method in advanced_rag_evaluator.py was computing adherence as:

# WRONG: Returns fraction
return fully_supported / total_sentences  # e.g., 2/3 = 0.667

This treats adherence as a "proportion of supported sentences" metric, which is incorrect.

Solution

Updated the method to compute adherence as a boolean according to RAGBench definition:

# CORRECT: Returns 1.0 or 0.0
return 1.0 if fully_supported_count == total_sentences else 0.0

RAGBench Definition of Adherence

Per the RAGBench paper, adherence is a boolean metric that indicates whether the response is fully grounded in the context:

  • 1.0: Fully grounded - ALL sentences in the response are fully supported by the retrieved documents
  • 0.0: Contains hallucination - ANY sentence in the response is not fully supported (hallucinated)

Examples

Scenario 1: All sentences fully supported

  • Sentence A: Supported βœ“
  • Sentence B: Supported βœ“
  • Sentence C: Supported βœ“
  • Adherence = 1.0 (fully grounded)

Scenario 2: One sentence not fully supported (hallucination)

  • Sentence A: Supported βœ“
  • Sentence B: Supported βœ“
  • Sentence C: NOT Supported βœ— (hallucinated)
  • Adherence = 0.0 (contains hallucination)

Scenario 3: No sentences fully supported

  • Sentence A: NOT Supported βœ—
  • Sentence B: NOT Supported βœ—
  • Sentence C: NOT Supported βœ—
  • Adherence = 0.0 (completely hallucinated)

File Changed

  • advanced_rag_evaluator.py (Lines 600-617)
    • Updated _compute_adherence() method

Impact

  • Adherence metric now returns only 0.0 or 1.0
  • Aligns with RAGBench paper specification
  • Better represents "grounded vs hallucinated" classification
  • More intuitive interpretation: 0 = has hallucinations, 1 = fully grounded

Testing

Verified with multiple scenarios:

  • All supported β†’ 1.0 βœ“
  • Partial support β†’ 0.0 βœ“
  • No support β†’ 0.0 βœ“

Related Metrics

For reference, the other metrics in GPT Labeling method:

  • Context Relevance: Fraction (0-1) - proportion of relevant sentences
  • Context Utilization: Fraction (0-1) - proportion of relevant sentences used
  • Completeness: Fraction (0-1) - proportion of answer info covered
  • Adherence: Boolean (0.0 or 1.0) - whether response is fully grounded ← FIXED