# TRACe Evaluation Framework - Alignment with RAGBench Paper

## Summary of Changes

This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the **TRACe framework** as defined in the RAGBench paper (arXiv:2407.11005).

---

## Key Clarifications

### The TRACe Framework is **4 metrics**, NOT 5

❌ **Incorrect**: T, R, A, C, **E** (with "E = Evaluation" as a separate metric)  
✅ **Correct**: T, R, A, C (as defined in the RAGBench paper)

The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.

---

## The 4 TRACe Metrics (Per RAGBench Paper)

### 1. **T — uTilization (Context Utilization)**

**Definition:**  
The fraction of retrieved context that the generator actually uses to produce the response.

**Formula:**
$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$

Where:
- $U_i$ = utilized (used) spans/tokens in document $d_i$
- $d_i$ = full document $i$
- Len = length (sentence, token, or character level)

**Interpretation:**
- Low Utilization + Low Relevance → Greedy retriever returning irrelevant docs
- Low Utilization alone → Weak generator fails to leverage good context
- High Utilization → Generator efficiently uses provided context

---

### 2. **R — Relevance (Context Relevance)**

**Definition:**  
The fraction of retrieved context that is actually relevant to answering the query.

**Formula:**
$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$

Where:
- $R_i$ = relevant (useful) spans/tokens in document $d_i$
- $d_i$ = full document $i$

**Interpretation:**
- High Relevance → Retriever returned mostly relevant documents
- Low Relevance → Retriever returned many irrelevant/noisy documents
- High Relevance but Low Utilization → Good docs retrieved, but generator doesn't use them

---

### 3. **A — Adherence (Faithfulness / Groundedness / Attribution)**

**Definition:**  
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.

**Paper Definition:**
- Example-level: **Boolean** — True if all response sentences are supported; False if any part is unsupported
- Span/Sentence-level: Can annotate which specific response sentences are grounded

**Interpretation:**
- High Adherence (1.0) → Response fully grounded, no hallucinations ✅
- Low Adherence (0.0) → Response contains unsupported claims ❌
- Mid Adherence → Partially grounded response

---

### 4. **C — Completeness**

**Definition:**  
How much of the relevant information in the context is actually covered/incorporated by the response.

**Formula:**
$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$

Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across documents

**Interpretation:**
- High Completeness → Generator covers all relevant information
- Low Completeness + High Utilization → Generator uses context but misses key facts
- Ideal RAG: High Relevance + High Utilization + High Completeness

---

## Code Changes Made

### 1. **EVALUATION_GUIDE.md**
- ✅ Updated header to reference RAGBench paper and TRACe (not TRACE)
- ✅ Removed incorrect "E = Evaluation" metric
- ✅ Added formal mathematical definitions for each metric per the paper
- ✅ Clarified when each metric is high/low and what it means for RAG systems

### 2. **trace_evaluator.py**
- ✅ Updated module docstring with paper reference and correct 4-metric framework
- ✅ Enhanced `TRACEEvaluator.__init__()` to accept metadata:
  - `chunking_strategy`: Which chunking strategy was used
  - `embedding_model`: Which embedding model was used
  - `chunk_size`: Chunk size configuration
  - `chunk_overlap`: Chunk overlap configuration
- ✅ Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility
- ✅ Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters
- ✅ Fixed numpy return types (wrap with `float()` to ensure proper type)

### 3. **vector_store.py (ChromaDBManager)**
- ✅ Added instance attributes to track evaluation-related metadata:
  - `self.chunking_strategy`
  - `self.chunk_size`
  - `self.chunk_overlap`
- ✅ Updated `load_dataset_into_collection()` to store chunking metadata
- ✅ Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections
- ✅ Ensures same chunking/embedding config is used for all questions in a test

### 4. **streamlit_app.py**
- ✅ Updated `run_evaluation()` to extract and log chunking/embedding metadata:
  - Logs chunking strategy, chunk size, chunk overlap
  - Logs embedding model used
  - Passes this metadata to TRACEEvaluator for tracking
- ✅ Added new log entries in evaluation flow:
  ```
  🔧 Retrieval Configuration:
    • Chunking Strategy: <strategy>
    • Chunk Size: <size>
    • Chunk Overlap: <overlap>
    • Embedding Model: <model>
  ```

---

## Benefits of These Changes

1. **Alignment with Paper**: Metrics now follow RAGBench paper definitions exactly
2. **Reproducibility**: Evaluation config (chunking, embedding) is stored and logged with results
3. **Consistency**: Same chunking/embedding used for all test questions per evaluation
4. **Clarity**: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
5. **Traceability**: Results can be audited to understand what retrieval config was used

---

## Usage Example

```python
from trace_evaluator import TRACEEvaluator

# Initialize with metadata
evaluator = TRACEEvaluator(
    chunking_strategy="dense",
    embedding_model="sentence-transformers/all-mpnet-base-v2",
    chunk_size=512,
    chunk_overlap=50
)

# Run evaluation
results = evaluator.evaluate_batch(test_cases)

# Results now include evaluation config
print(results["evaluation_config"])
# Output: {
#   "chunking_strategy": "dense",
#   "embedding_model": "sentence-transformers/all-mpnet-base-v2",
#   "chunk_size": 512,
#   "chunk_overlap": 50
# }
```

---

## Future Improvements

1. Implement **span-level annotation** following RAGBench approach for ground truth metrics
2. Add **fine-tuned evaluator models** (e.g., DeBERTa) for more accurate metric computation
3. Store evaluation results with full metadata in persistent storage for historical tracking
4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores

---

## References

- **RAGBench Paper**: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
  - arXiv: 2407.11005v2
  - Dataset: https://huggingface.co/datasets/rungalileo/ragbench
  - GitHub: https://github.com/rungalileo/ragbench