Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /TRACE_EVALUATION_ALIGNMENT.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

6.79 kB

	# TRACe Evaluation Framework - Alignment with RAGBench Paper

	## Summary of Changes

	This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the TRACe framework as defined in the RAGBench paper (arXiv:2407.11005).

	---

	## Key Clarifications

	### The TRACe Framework is 4 metrics, NOT 5

	❌ Incorrect: T, R, A, C, E (with "E = Evaluation" as a separate metric)
	✅ Correct: T, R, A, C (as defined in the RAGBench paper)

	The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.

	---

	## The 4 TRACe Metrics (Per RAGBench Paper)

	### 1. T — uTilization (Context Utilization)

	Definition:
	The fraction of retrieved context that the generator actually uses to produce the response.

	Formula:
	$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$

	Where:
	- $U_i$ = utilized (used) spans/tokens in document $d_i$
	- $d_i$ = full document $i$
	- Len = length (sentence, token, or character level)

	Interpretation:
	- Low Utilization + Low Relevance → Greedy retriever returning irrelevant docs
	- Low Utilization alone → Weak generator fails to leverage good context
	- High Utilization → Generator efficiently uses provided context

	---

	### 2. R — Relevance (Context Relevance)

	Definition:
	The fraction of retrieved context that is actually relevant to answering the query.

	Formula:
	$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$

	Where:
	- $R_i$ = relevant (useful) spans/tokens in document $d_i$
	- $d_i$ = full document $i$

	Interpretation:
	- High Relevance → Retriever returned mostly relevant documents
	- Low Relevance → Retriever returned many irrelevant/noisy documents
	- High Relevance but Low Utilization → Good docs retrieved, but generator doesn't use them

	---

	### 3. A — Adherence (Faithfulness / Groundedness / Attribution)

	Definition:
	Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.

	Paper Definition:
	- Example-level: Boolean — True if all response sentences are supported; False if any part is unsupported
	- Span/Sentence-level: Can annotate which specific response sentences are grounded

	Interpretation:
	- High Adherence (1.0) → Response fully grounded, no hallucinations ✅
	- Low Adherence (0.0) → Response contains unsupported claims ❌
	- Mid Adherence → Partially grounded response

	---

	### 4. C — Completeness

	Definition:
	How much of the relevant information in the context is actually covered/incorporated by the response.

	Formula:
	$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$

	Where:
	- $R_i \cap U_i$ = intersection of relevant AND utilized spans
	- $R_i$ = all relevant spans
	- Extended to example-level by aggregating across documents

	Interpretation:
	- High Completeness → Generator covers all relevant information
	- Low Completeness + High Utilization → Generator uses context but misses key facts
	- Ideal RAG: High Relevance + High Utilization + High Completeness

	---

	## Code Changes Made

	### 1. EVALUATION_GUIDE.md
	- ✅ Updated header to reference RAGBench paper and TRACe (not TRACE)
	- ✅ Removed incorrect "E = Evaluation" metric
	- ✅ Added formal mathematical definitions for each metric per the paper
	- ✅ Clarified when each metric is high/low and what it means for RAG systems

	### 2. trace_evaluator.py
	- ✅ Updated module docstring with paper reference and correct 4-metric framework
	- ✅ Enhanced `TRACEEvaluator.__init__()` to accept metadata:
	- `chunking_strategy`: Which chunking strategy was used
	- `embedding_model`: Which embedding model was used
	- `chunk_size`: Chunk size configuration
	- `chunk_overlap`: Chunk overlap configuration
	- ✅ Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility
	- ✅ Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters
	- ✅ Fixed numpy return types (wrap with `float()` to ensure proper type)

	### 3. vector_store.py (ChromaDBManager)
	- ✅ Added instance attributes to track evaluation-related metadata:
	- `self.chunking_strategy`
	- `self.chunk_size`
	- `self.chunk_overlap`
	- ✅ Updated `load_dataset_into_collection()` to store chunking metadata
	- ✅ Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections
	- ✅ Ensures same chunking/embedding config is used for all questions in a test

	### 4. streamlit_app.py
	- ✅ Updated `run_evaluation()` to extract and log chunking/embedding metadata:
	- Logs chunking strategy, chunk size, chunk overlap
	- Logs embedding model used
	- Passes this metadata to TRACEEvaluator for tracking
	- ✅ Added new log entries in evaluation flow:
	```
	🔧 Retrieval Configuration:
	• Chunking Strategy: <strategy>
	• Chunk Size: <size>
	• Chunk Overlap: <overlap>
	• Embedding Model: <model>
	```

	---

	## Benefits of These Changes

	1. Alignment with Paper: Metrics now follow RAGBench paper definitions exactly
	2. Reproducibility: Evaluation config (chunking, embedding) is stored and logged with results
	3. Consistency: Same chunking/embedding used for all test questions per evaluation
	4. Clarity: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
	5. Traceability: Results can be audited to understand what retrieval config was used

	---

	## Usage Example

	```python
	from trace_evaluator import TRACEEvaluator

	# Initialize with metadata
	evaluator = TRACEEvaluator(
	chunking_strategy="dense",
	embedding_model="sentence-transformers/all-mpnet-base-v2",
	chunk_size=512,
	chunk_overlap=50
	)

	# Run evaluation
	results = evaluator.evaluate_batch(test_cases)

	# Results now include evaluation config
	print(results["evaluation_config"])
	# Output: {
	# "chunking_strategy": "dense",
	# "embedding_model": "sentence-transformers/all-mpnet-base-v2",
	# "chunk_size": 512,
	# "chunk_overlap": 50
	# }
	```

	---

	## Future Improvements

	1. Implement span-level annotation following RAGBench approach for ground truth metrics
	2. Add fine-tuned evaluator models (e.g., DeBERTa) for more accurate metric computation
	3. Store evaluation results with full metadata in persistent storage for historical tracking
	4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores

	---

	## References

	- RAGBench Paper: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
	- arXiv: 2407.11005v2
	- Dataset: https://huggingface.co/datasets/rungalileo/ragbench
	- GitHub: https://github.com/rungalileo/ragbench