Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /TRACE_EVALUATION_ALIGNMENT.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

6.79 kB

TRACe Evaluation Framework - Alignment with RAGBench Paper

Summary of Changes

This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the TRACe framework as defined in the RAGBench paper (arXiv:2407.11005).

Key Clarifications

The TRACe Framework is 4 metrics, NOT 5

❌ Incorrect: T, R, A, C, E (with "E = Evaluation" as a separate metric)
✅ Correct: T, R, A, C (as defined in the RAGBench paper)

The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.

The 4 TRACe Metrics (Per RAGBench Paper)

1. T — uTilization (Context Utilization)

Definition:
The fraction of retrieved context that the generator actually uses to produce the response.

Formula: $\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$

Where:

$U_i$ = utilized (used) spans/tokens in document $d_i$
$d_i$ = full document $i$
Len = length (sentence, token, or character level)

Interpretation:

Low Utilization + Low Relevance → Greedy retriever returning irrelevant docs
Low Utilization alone → Weak generator fails to leverage good context
High Utilization → Generator efficiently uses provided context

2. R — Relevance (Context Relevance)

Definition:
The fraction of retrieved context that is actually relevant to answering the query.

Formula: $\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$

Where:

$R_i$ = relevant (useful) spans/tokens in document $d_i$
$d_i$ = full document $i$

Interpretation:

High Relevance → Retriever returned mostly relevant documents
Low Relevance → Retriever returned many irrelevant/noisy documents
High Relevance but Low Utilization → Good docs retrieved, but generator doesn't use them

3. A — Adherence (Faithfulness / Groundedness / Attribution)

Definition:
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.

Paper Definition:

Example-level: Boolean — True if all response sentences are supported; False if any part is unsupported
Span/Sentence-level: Can annotate which specific response sentences are grounded

Interpretation:

High Adherence (1.0) → Response fully grounded, no hallucinations ✅
Low Adherence (0.0) → Response contains unsupported claims ❌
Mid Adherence → Partially grounded response

4. C — Completeness

Definition:
How much of the relevant information in the context is actually covered/incorporated by the response.

Formula: $\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$

Where:

$R_i \cap U_i$ = intersection of relevant AND utilized spans
$R_i$ = all relevant spans
Extended to example-level by aggregating across documents

Interpretation:

High Completeness → Generator covers all relevant information
Low Completeness + High Utilization → Generator uses context but misses key facts
Ideal RAG: High Relevance + High Utilization + High Completeness

Code Changes Made

1. EVALUATION_GUIDE.md

✅ Updated header to reference RAGBench paper and TRACe (not TRACE)
✅ Removed incorrect "E = Evaluation" metric
✅ Added formal mathematical definitions for each metric per the paper
✅ Clarified when each metric is high/low and what it means for RAG systems

2. trace_evaluator.py

✅ Updated module docstring with paper reference and correct 4-metric framework
✅ Enhanced TRACEEvaluator.__init__() to accept metadata:
- chunking_strategy: Which chunking strategy was used
- embedding_model: Which embedding model was used
- chunk_size: Chunk size configuration
- chunk_overlap: Chunk overlap configuration
✅ Updated evaluate_batch() to include evaluation config in results dict for reproducibility
✅ Fixed type hints to use Optional[str] and Optional[int] for optional parameters
✅ Fixed numpy return types (wrap with float() to ensure proper type)

3. vector_store.py (ChromaDBManager)

✅ Added instance attributes to track evaluation-related metadata:
- self.chunking_strategy
- self.chunk_size
- self.chunk_overlap
✅ Updated load_dataset_into_collection() to store chunking metadata
✅ Updated get_collection() to restore chunking metadata from collection metadata when loading existing collections
✅ Ensures same chunking/embedding config is used for all questions in a test

4. streamlit_app.py

✅ Updated run_evaluation() to extract and log chunking/embedding metadata:
- Logs chunking strategy, chunk size, chunk overlap
- Logs embedding model used
- Passes this metadata to TRACEEvaluator for tracking

✅ Added new log entries in evaluation flow:

🔧 Retrieval Configuration:
  • Chunking Strategy: <strategy>
  • Chunk Size: <size>
  • Chunk Overlap: <overlap>
  • Embedding Model: <model>

Benefits of These Changes

Alignment with Paper: Metrics now follow RAGBench paper definitions exactly
Reproducibility: Evaluation config (chunking, embedding) is stored and logged with results
Consistency: Same chunking/embedding used for all test questions per evaluation
Clarity: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
Traceability: Results can be audited to understand what retrieval config was used

Usage Example

from trace_evaluator import TRACEEvaluator

# Initialize with metadata
evaluator = TRACEEvaluator(
    chunking_strategy="dense",
    embedding_model="sentence-transformers/all-mpnet-base-v2",
    chunk_size=512,
    chunk_overlap=50
)

# Run evaluation
results = evaluator.evaluate_batch(test_cases)

# Results now include evaluation config
print(results["evaluation_config"])
# Output: {
#   "chunking_strategy": "dense",
#   "embedding_model": "sentence-transformers/all-mpnet-base-v2",
#   "chunk_size": 512,
#   "chunk_overlap": 50
# }

Future Improvements

Implement span-level annotation following RAGBench approach for ground truth metrics
Add fine-tuned evaluator models (e.g., DeBERTa) for more accurate metric computation
Store evaluation results with full metadata in persistent storage for historical tracking
Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores

References

RAGBench Paper: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
- arXiv: 2407.11005v2
- Dataset: https://huggingface.co/datasets/rungalileo/ragbench
- GitHub: https://github.com/rungalileo/ragbench