Spaces:
Sleeping
Sleeping
| # TRACe Evaluation Framework - Alignment with RAGBench Paper | |
| ## Summary of Changes | |
| This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the **TRACe framework** as defined in the RAGBench paper (arXiv:2407.11005). | |
| --- | |
| ## Key Clarifications | |
| ### The TRACe Framework is **4 metrics**, NOT 5 | |
| β **Incorrect**: T, R, A, C, **E** (with "E = Evaluation" as a separate metric) | |
| β **Correct**: T, R, A, C (as defined in the RAGBench paper) | |
| The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric. | |
| --- | |
| ## The 4 TRACe Metrics (Per RAGBench Paper) | |
| ### 1. **T β uTilization (Context Utilization)** | |
| **Definition:** | |
| The fraction of retrieved context that the generator actually uses to produce the response. | |
| **Formula:** | |
| $$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$ | |
| Where: | |
| - $U_i$ = utilized (used) spans/tokens in document $d_i$ | |
| - $d_i$ = full document $i$ | |
| - Len = length (sentence, token, or character level) | |
| **Interpretation:** | |
| - Low Utilization + Low Relevance β Greedy retriever returning irrelevant docs | |
| - Low Utilization alone β Weak generator fails to leverage good context | |
| - High Utilization β Generator efficiently uses provided context | |
| --- | |
| ### 2. **R β Relevance (Context Relevance)** | |
| **Definition:** | |
| The fraction of retrieved context that is actually relevant to answering the query. | |
| **Formula:** | |
| $$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$ | |
| Where: | |
| - $R_i$ = relevant (useful) spans/tokens in document $d_i$ | |
| - $d_i$ = full document $i$ | |
| **Interpretation:** | |
| - High Relevance β Retriever returned mostly relevant documents | |
| - Low Relevance β Retriever returned many irrelevant/noisy documents | |
| - High Relevance but Low Utilization β Good docs retrieved, but generator doesn't use them | |
| --- | |
| ### 3. **A β Adherence (Faithfulness / Groundedness / Attribution)** | |
| **Definition:** | |
| Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations. | |
| **Paper Definition:** | |
| - Example-level: **Boolean** β True if all response sentences are supported; False if any part is unsupported | |
| - Span/Sentence-level: Can annotate which specific response sentences are grounded | |
| **Interpretation:** | |
| - High Adherence (1.0) β Response fully grounded, no hallucinations β | |
| - Low Adherence (0.0) β Response contains unsupported claims β | |
| - Mid Adherence β Partially grounded response | |
| --- | |
| ### 4. **C β Completeness** | |
| **Definition:** | |
| How much of the relevant information in the context is actually covered/incorporated by the response. | |
| **Formula:** | |
| $$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$ | |
| Where: | |
| - $R_i \cap U_i$ = intersection of relevant AND utilized spans | |
| - $R_i$ = all relevant spans | |
| - Extended to example-level by aggregating across documents | |
| **Interpretation:** | |
| - High Completeness β Generator covers all relevant information | |
| - Low Completeness + High Utilization β Generator uses context but misses key facts | |
| - Ideal RAG: High Relevance + High Utilization + High Completeness | |
| --- | |
| ## Code Changes Made | |
| ### 1. **EVALUATION_GUIDE.md** | |
| - β Updated header to reference RAGBench paper and TRACe (not TRACE) | |
| - β Removed incorrect "E = Evaluation" metric | |
| - β Added formal mathematical definitions for each metric per the paper | |
| - β Clarified when each metric is high/low and what it means for RAG systems | |
| ### 2. **trace_evaluator.py** | |
| - β Updated module docstring with paper reference and correct 4-metric framework | |
| - β Enhanced `TRACEEvaluator.__init__()` to accept metadata: | |
| - `chunking_strategy`: Which chunking strategy was used | |
| - `embedding_model`: Which embedding model was used | |
| - `chunk_size`: Chunk size configuration | |
| - `chunk_overlap`: Chunk overlap configuration | |
| - β Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility | |
| - β Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters | |
| - β Fixed numpy return types (wrap with `float()` to ensure proper type) | |
| ### 3. **vector_store.py (ChromaDBManager)** | |
| - β Added instance attributes to track evaluation-related metadata: | |
| - `self.chunking_strategy` | |
| - `self.chunk_size` | |
| - `self.chunk_overlap` | |
| - β Updated `load_dataset_into_collection()` to store chunking metadata | |
| - β Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections | |
| - β Ensures same chunking/embedding config is used for all questions in a test | |
| ### 4. **streamlit_app.py** | |
| - β Updated `run_evaluation()` to extract and log chunking/embedding metadata: | |
| - Logs chunking strategy, chunk size, chunk overlap | |
| - Logs embedding model used | |
| - Passes this metadata to TRACEEvaluator for tracking | |
| - β Added new log entries in evaluation flow: | |
| ``` | |
| π§ Retrieval Configuration: | |
| β’ Chunking Strategy: <strategy> | |
| β’ Chunk Size: <size> | |
| β’ Chunk Overlap: <overlap> | |
| β’ Embedding Model: <model> | |
| ``` | |
| --- | |
| ## Benefits of These Changes | |
| 1. **Alignment with Paper**: Metrics now follow RAGBench paper definitions exactly | |
| 2. **Reproducibility**: Evaluation config (chunking, embedding) is stored and logged with results | |
| 3. **Consistency**: Same chunking/embedding used for all test questions per evaluation | |
| 4. **Clarity**: Clear distinction between 4 metrics (no misleading "5-metric" interpretation) | |
| 5. **Traceability**: Results can be audited to understand what retrieval config was used | |
| --- | |
| ## Usage Example | |
| ```python | |
| from trace_evaluator import TRACEEvaluator | |
| # Initialize with metadata | |
| evaluator = TRACEEvaluator( | |
| chunking_strategy="dense", | |
| embedding_model="sentence-transformers/all-mpnet-base-v2", | |
| chunk_size=512, | |
| chunk_overlap=50 | |
| ) | |
| # Run evaluation | |
| results = evaluator.evaluate_batch(test_cases) | |
| # Results now include evaluation config | |
| print(results["evaluation_config"]) | |
| # Output: { | |
| # "chunking_strategy": "dense", | |
| # "embedding_model": "sentence-transformers/all-mpnet-base-v2", | |
| # "chunk_size": 512, | |
| # "chunk_overlap": 50 | |
| # } | |
| ``` | |
| --- | |
| ## Future Improvements | |
| 1. Implement **span-level annotation** following RAGBench approach for ground truth metrics | |
| 2. Add **fine-tuned evaluator models** (e.g., DeBERTa) for more accurate metric computation | |
| 3. Store evaluation results with full metadata in persistent storage for historical tracking | |
| 4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores | |
| --- | |
| ## References | |
| - **RAGBench Paper**: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems" | |
| - arXiv: 2407.11005v2 | |
| - Dataset: https://huggingface.co/datasets/rungalileo/ragbench | |
| - GitHub: https://github.com/rungalileo/ragbench | |