# TRACe Evaluation Framework - Alignment with RAGBench Paper ## Summary of Changes This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the **TRACe framework** as defined in the RAGBench paper (arXiv:2407.11005). --- ## Key Clarifications ### The TRACe Framework is **4 metrics**, NOT 5 ❌ **Incorrect**: T, R, A, C, **E** (with "E = Evaluation" as a separate metric) ✅ **Correct**: T, R, A, C (as defined in the RAGBench paper) The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric. --- ## The 4 TRACe Metrics (Per RAGBench Paper) ### 1. **T — uTilization (Context Utilization)** **Definition:** The fraction of retrieved context that the generator actually uses to produce the response. **Formula:** $$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$ Where: - $U_i$ = utilized (used) spans/tokens in document $d_i$ - $d_i$ = full document $i$ - Len = length (sentence, token, or character level) **Interpretation:** - Low Utilization + Low Relevance → Greedy retriever returning irrelevant docs - Low Utilization alone → Weak generator fails to leverage good context - High Utilization → Generator efficiently uses provided context --- ### 2. **R — Relevance (Context Relevance)** **Definition:** The fraction of retrieved context that is actually relevant to answering the query. **Formula:** $$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$ Where: - $R_i$ = relevant (useful) spans/tokens in document $d_i$ - $d_i$ = full document $i$ **Interpretation:** - High Relevance → Retriever returned mostly relevant documents - Low Relevance → Retriever returned many irrelevant/noisy documents - High Relevance but Low Utilization → Good docs retrieved, but generator doesn't use them --- ### 3. **A — Adherence (Faithfulness / Groundedness / Attribution)** **Definition:** Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations. **Paper Definition:** - Example-level: **Boolean** — True if all response sentences are supported; False if any part is unsupported - Span/Sentence-level: Can annotate which specific response sentences are grounded **Interpretation:** - High Adherence (1.0) → Response fully grounded, no hallucinations ✅ - Low Adherence (0.0) → Response contains unsupported claims ❌ - Mid Adherence → Partially grounded response --- ### 4. **C — Completeness** **Definition:** How much of the relevant information in the context is actually covered/incorporated by the response. **Formula:** $$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$ Where: - $R_i \cap U_i$ = intersection of relevant AND utilized spans - $R_i$ = all relevant spans - Extended to example-level by aggregating across documents **Interpretation:** - High Completeness → Generator covers all relevant information - Low Completeness + High Utilization → Generator uses context but misses key facts - Ideal RAG: High Relevance + High Utilization + High Completeness --- ## Code Changes Made ### 1. **EVALUATION_GUIDE.md** - ✅ Updated header to reference RAGBench paper and TRACe (not TRACE) - ✅ Removed incorrect "E = Evaluation" metric - ✅ Added formal mathematical definitions for each metric per the paper - ✅ Clarified when each metric is high/low and what it means for RAG systems ### 2. **trace_evaluator.py** - ✅ Updated module docstring with paper reference and correct 4-metric framework - ✅ Enhanced `TRACEEvaluator.__init__()` to accept metadata: - `chunking_strategy`: Which chunking strategy was used - `embedding_model`: Which embedding model was used - `chunk_size`: Chunk size configuration - `chunk_overlap`: Chunk overlap configuration - ✅ Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility - ✅ Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters - ✅ Fixed numpy return types (wrap with `float()` to ensure proper type) ### 3. **vector_store.py (ChromaDBManager)** - ✅ Added instance attributes to track evaluation-related metadata: - `self.chunking_strategy` - `self.chunk_size` - `self.chunk_overlap` - ✅ Updated `load_dataset_into_collection()` to store chunking metadata - ✅ Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections - ✅ Ensures same chunking/embedding config is used for all questions in a test ### 4. **streamlit_app.py** - ✅ Updated `run_evaluation()` to extract and log chunking/embedding metadata: - Logs chunking strategy, chunk size, chunk overlap - Logs embedding model used - Passes this metadata to TRACEEvaluator for tracking - ✅ Added new log entries in evaluation flow: ``` 🔧 Retrieval Configuration: • Chunking Strategy: • Chunk Size: • Chunk Overlap: • Embedding Model: ``` --- ## Benefits of These Changes 1. **Alignment with Paper**: Metrics now follow RAGBench paper definitions exactly 2. **Reproducibility**: Evaluation config (chunking, embedding) is stored and logged with results 3. **Consistency**: Same chunking/embedding used for all test questions per evaluation 4. **Clarity**: Clear distinction between 4 metrics (no misleading "5-metric" interpretation) 5. **Traceability**: Results can be audited to understand what retrieval config was used --- ## Usage Example ```python from trace_evaluator import TRACEEvaluator # Initialize with metadata evaluator = TRACEEvaluator( chunking_strategy="dense", embedding_model="sentence-transformers/all-mpnet-base-v2", chunk_size=512, chunk_overlap=50 ) # Run evaluation results = evaluator.evaluate_batch(test_cases) # Results now include evaluation config print(results["evaluation_config"]) # Output: { # "chunking_strategy": "dense", # "embedding_model": "sentence-transformers/all-mpnet-base-v2", # "chunk_size": 512, # "chunk_overlap": 50 # } ``` --- ## Future Improvements 1. Implement **span-level annotation** following RAGBench approach for ground truth metrics 2. Add **fine-tuned evaluator models** (e.g., DeBERTa) for more accurate metric computation 3. Store evaluation results with full metadata in persistent storage for historical tracking 4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores --- ## References - **RAGBench Paper**: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems" - arXiv: 2407.11005v2 - Dataset: https://huggingface.co/datasets/rungalileo/ragbench - GitHub: https://github.com/rungalileo/ragbench