Spaces:
Sleeping
TRACe Evaluation Framework - Alignment with RAGBench Paper
Summary of Changes
This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the TRACe framework as defined in the RAGBench paper (arXiv:2407.11005).
Key Clarifications
The TRACe Framework is 4 metrics, NOT 5
β Incorrect: T, R, A, C, E (with "E = Evaluation" as a separate metric)
β
Correct: T, R, A, C (as defined in the RAGBench paper)
The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.
The 4 TRACe Metrics (Per RAGBench Paper)
1. T β uTilization (Context Utilization)
Definition:
The fraction of retrieved context that the generator actually uses to produce the response.
Formula:
Where:
- $U_i$ = utilized (used) spans/tokens in document $d_i$
- $d_i$ = full document $i$
- Len = length (sentence, token, or character level)
Interpretation:
- Low Utilization + Low Relevance β Greedy retriever returning irrelevant docs
- Low Utilization alone β Weak generator fails to leverage good context
- High Utilization β Generator efficiently uses provided context
2. R β Relevance (Context Relevance)
Definition:
The fraction of retrieved context that is actually relevant to answering the query.
Formula:
Where:
- $R_i$ = relevant (useful) spans/tokens in document $d_i$
- $d_i$ = full document $i$
Interpretation:
- High Relevance β Retriever returned mostly relevant documents
- Low Relevance β Retriever returned many irrelevant/noisy documents
- High Relevance but Low Utilization β Good docs retrieved, but generator doesn't use them
3. A β Adherence (Faithfulness / Groundedness / Attribution)
Definition:
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.
Paper Definition:
- Example-level: Boolean β True if all response sentences are supported; False if any part is unsupported
- Span/Sentence-level: Can annotate which specific response sentences are grounded
Interpretation:
- High Adherence (1.0) β Response fully grounded, no hallucinations β
- Low Adherence (0.0) β Response contains unsupported claims β
- Mid Adherence β Partially grounded response
4. C β Completeness
Definition:
How much of the relevant information in the context is actually covered/incorporated by the response.
Formula:
Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across documents
Interpretation:
- High Completeness β Generator covers all relevant information
- Low Completeness + High Utilization β Generator uses context but misses key facts
- Ideal RAG: High Relevance + High Utilization + High Completeness
Code Changes Made
1. EVALUATION_GUIDE.md
- β Updated header to reference RAGBench paper and TRACe (not TRACE)
- β Removed incorrect "E = Evaluation" metric
- β Added formal mathematical definitions for each metric per the paper
- β Clarified when each metric is high/low and what it means for RAG systems
2. trace_evaluator.py
- β Updated module docstring with paper reference and correct 4-metric framework
- β
Enhanced
TRACEEvaluator.__init__()to accept metadata:chunking_strategy: Which chunking strategy was usedembedding_model: Which embedding model was usedchunk_size: Chunk size configurationchunk_overlap: Chunk overlap configuration
- β
Updated
evaluate_batch()to include evaluation config in results dict for reproducibility - β
Fixed type hints to use
Optional[str]andOptional[int]for optional parameters - β
Fixed numpy return types (wrap with
float()to ensure proper type)
3. vector_store.py (ChromaDBManager)
- β
Added instance attributes to track evaluation-related metadata:
self.chunking_strategyself.chunk_sizeself.chunk_overlap
- β
Updated
load_dataset_into_collection()to store chunking metadata - β
Updated
get_collection()to restore chunking metadata from collection metadata when loading existing collections - β Ensures same chunking/embedding config is used for all questions in a test
4. streamlit_app.py
- β
Updated
run_evaluation()to extract and log chunking/embedding metadata:- Logs chunking strategy, chunk size, chunk overlap
- Logs embedding model used
- Passes this metadata to TRACEEvaluator for tracking
- β
Added new log entries in evaluation flow:
π§ Retrieval Configuration: β’ Chunking Strategy: <strategy> β’ Chunk Size: <size> β’ Chunk Overlap: <overlap> β’ Embedding Model: <model>
Benefits of These Changes
- Alignment with Paper: Metrics now follow RAGBench paper definitions exactly
- Reproducibility: Evaluation config (chunking, embedding) is stored and logged with results
- Consistency: Same chunking/embedding used for all test questions per evaluation
- Clarity: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
- Traceability: Results can be audited to understand what retrieval config was used
Usage Example
from trace_evaluator import TRACEEvaluator
# Initialize with metadata
evaluator = TRACEEvaluator(
chunking_strategy="dense",
embedding_model="sentence-transformers/all-mpnet-base-v2",
chunk_size=512,
chunk_overlap=50
)
# Run evaluation
results = evaluator.evaluate_batch(test_cases)
# Results now include evaluation config
print(results["evaluation_config"])
# Output: {
# "chunking_strategy": "dense",
# "embedding_model": "sentence-transformers/all-mpnet-base-v2",
# "chunk_size": 512,
# "chunk_overlap": 50
# }
Future Improvements
- Implement span-level annotation following RAGBench approach for ground truth metrics
- Add fine-tuned evaluator models (e.g., DeBERTa) for more accurate metric computation
- Store evaluation results with full metadata in persistent storage for historical tracking
- Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores
References
- RAGBench Paper: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
- arXiv: 2407.11005v2
- Dataset: https://huggingface.co/datasets/rungalileo/ragbench
- GitHub: https://github.com/rungalileo/ragbench