Spaces:
Sleeping
TRACe Metrics Fixes - Summary
Issues Fixed
1. Evaluation Logs Appearing Multiple Times βββ
Problem:
The "π Evaluation Logs:" header was appearing multiple times in the UI because st.write() was being called inside the add_log() function, which gets called repeatedly as logs are added.
Root Cause:
def add_log(message: str):
logs_list.append(message)
with logs_container:
st.write("π **Evaluation Logs:**") # β Called every time add_log() is called
for log_msg in logs_list:
st.caption(log_msg)
Solution:
Move the header rendering outside the function and use a placeholder that updates once:
# Display logs header once outside function
logs_placeholder = st.empty()
def add_log(message: str):
logs_list.append(message)
with logs_placeholder.container():
st.markdown("### π Evaluation Logs:") # β Only rendered once
for log_msg in logs_list:
st.caption(log_msg)
Result: Header now appears only once at the top of logs. β
2. Adherence Metric Returning Decimal Instead of Boolean βββ
Problem:
Adherence was returning a decimal value (0.0β1.0) instead of a Boolean (0.0 = hallucinated, 1.0 = grounded).
Root Cause:
The metric was computing the average grounding ratio across all sentences:
adherence_scores = []
for sentence in response_sentences:
grounding_ratio = grounded_words / len(sentence_words)
adherence_scores.append(grounding_ratio)
return np.mean(adherence_scores) # β Returns 0.1 to 0.9, not Boolean
Paper Definition:
Per RAGBench paper: "Adherence is a Boolean indicating whether ALL response sentences are supported by context."
Solution:
Implement Boolean logic: if ANY sentence has low grounding, entire response is marked as hallucinated:
def _compute_adherence(...) -> float:
grounding_threshold = 0.5 # At least 50% of words in docs
all_grounded = True
for sentence in response_sentences:
grounding_ratio = grounded_words / len(sentence_words)
if grounding_ratio < grounding_threshold:
all_grounded = False
break
# Return Boolean: 1.0 (grounded) or 0.0 (hallucinated)
return 1.0 if all_grounded else 0.0
Result:
- Adherence now returns only
1.0(fully grounded) or0.0(contains hallucinations) β - Interpretation: Response is either fully trusted or marked as untrustworthy
3. Completeness Always Returning 1.0 βββ
Problem:
Completeness metric was consistently returning 1.0 because the logic always appended a 1.0 when the response type matched the query type.
Root Cause:
# Check for appropriate response type
if is_when and any(w in response_lower for w in ["year", "date", "time", "century"]):
completeness_factors.append(1.0) # β Always 1.0 if keyword found
elif is_where and any(w in response_lower for w in ["location", "place", "country", "city"]):
completeness_factors.append(1.0) # β Always 1.0
# ... more conditions always appending 1.0
Paper Definition:
Per RAGBench: "Completeness = Len(R_i β© U_i) / Len(R_i)" β how much of RELEVANT info is covered.
Solution:
Implement proper weighted scoring:
completeness_scores = []
# Score 1: Response has substantive content (30% weight)
min_content_words = 10
length_score = min(len(response_words) / min_content_words, 1.0)
completeness_scores.append(length_score * 0.3)
# Score 2: Ground truth coverage (70% weight) β IF available
if ground_truth:
gt_words = set(self._tokenize(gt_lower))
overlap = len(gt_words & response_words)
gt_coverage = overlap / len(gt_words) # β Actual overlap ratio
completeness_scores.append(gt_coverage * 0.7)
else:
# Without ground truth: heuristic check for answer type keywords
base_score = 0.3
if found_relevant_keywords:
base_score = 0.7
completeness_scores.append(base_score)
# Average weighted scores
return np.mean(completeness_scores) # β Returns 0.3 to 1.0, not always 1.0
Result:
- Completeness now varies (0.3β1.0) based on actual ground truth coverage β
- Without ground truth, uses heuristic keyword matching (0.3 = unlikely complete, 0.7 = likely complete)
- Values are now realistic and informative
Metric Outputs After Fixes
Example Results:
| Metric | Value | Interpretation |
|---|---|---|
| Utilization | 0.72 | 72% of retrieved docs were used by generator |
| Relevance | 0.85 | 85% of retrieved docs are relevant to query |
| Adherence | 1.0 | Response is fully grounded β (no hallucinations) |
| Adherence | 0.0 | Response contains unsupported claims β (hallucinated) |
| Completeness | 0.45 | Response covers ~45% of relevant information |
| Average | 0.75 | Overall RAG system quality score |
Files Modified
streamlit_app.py(Line ~665)- Fixed: Evaluation logs header repetition
trace_evaluator.py(Two methods)- Fixed:
_compute_adherence()β now returns Boolean (1.0 or 0.0) - Fixed:
_compute_completeness()β now computes proper weighted coverage score
- Fixed:
Testing the Fixes
To verify the fixes are working:
- Run an evaluation on a dataset
- Check the UI:
- Logs header should appear only once
- Adherence shows only
1.0or0.0 - Completeness shows varied values (e.g., 0.3β0.9)
- Check the results:
- Values should be realistic and different for different test cases
- All 4 metrics should now align with RAGBench paper definitions
Related Documentation
- RAGBench Paper: arXiv:2407.11005v2
- Alignment Guide:
TRACE_EVALUATION_ALIGNMENT.md - Evaluation Guide:
EVALUATION_GUIDE.md(updated)