Spaces:
Sleeping
Sleeping
Adherence Metric Fix - Summary
Problem
The adherence metric was returning decimal values (e.g., 0.333, 0.667, 0.8) instead of boolean values (0.0 or 1.0) as defined in the RAGBench paper.
Root Cause
The _compute_adherence() method in advanced_rag_evaluator.py was computing adherence as:
# WRONG: Returns fraction
return fully_supported / total_sentences # e.g., 2/3 = 0.667
This treats adherence as a "proportion of supported sentences" metric, which is incorrect.
Solution
Updated the method to compute adherence as a boolean according to RAGBench definition:
# CORRECT: Returns 1.0 or 0.0
return 1.0 if fully_supported_count == total_sentences else 0.0
RAGBench Definition of Adherence
Per the RAGBench paper, adherence is a boolean metric that indicates whether the response is fully grounded in the context:
- 1.0: Fully grounded - ALL sentences in the response are fully supported by the retrieved documents
- 0.0: Contains hallucination - ANY sentence in the response is not fully supported (hallucinated)
Examples
Scenario 1: All sentences fully supported
- Sentence A: Supported β
- Sentence B: Supported β
- Sentence C: Supported β
- Adherence = 1.0 (fully grounded)
Scenario 2: One sentence not fully supported (hallucination)
- Sentence A: Supported β
- Sentence B: Supported β
- Sentence C: NOT Supported β (hallucinated)
- Adherence = 0.0 (contains hallucination)
Scenario 3: No sentences fully supported
- Sentence A: NOT Supported β
- Sentence B: NOT Supported β
- Sentence C: NOT Supported β
- Adherence = 0.0 (completely hallucinated)
File Changed
- advanced_rag_evaluator.py (Lines 600-617)
- Updated
_compute_adherence()method
- Updated
Impact
- Adherence metric now returns only 0.0 or 1.0
- Aligns with RAGBench paper specification
- Better represents "grounded vs hallucinated" classification
- More intuitive interpretation: 0 = has hallucinations, 1 = fully grounded
Testing
Verified with multiple scenarios:
- All supported β 1.0 β
- Partial support β 0.0 β
- No support β 0.0 β
Related Metrics
For reference, the other metrics in GPT Labeling method:
- Context Relevance: Fraction (0-1) - proportion of relevant sentences
- Context Utilization: Fraction (0-1) - proportion of relevant sentences used
- Completeness: Fraction (0-1) - proportion of answer info covered
- Adherence: Boolean (0.0 or 1.0) - whether response is fully grounded β FIXED