Spaces:

gopikrishnait
/

CapStoneRAG10

Running

App Files Files Community

CapStoneRAG10 / docs /TRACE_DETAILED_RESULTS.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a 29 days ago

preview code

raw

history blame contribute delete

5.34 kB

TRACE Evaluation - Detailed Results Export

Overview

Enhanced TRACE evaluation to save comprehensive per-query data including questions, retrieved documents, LLM responses, and metrics for each test case.

Changes Made

1. trace_evaluator.py - Updated `evaluate_batch()` method

Added detailed_results array to the evaluation output that captures:

{
    "query_id": 1,
    "question": "What is machine learning?",
    "llm_response": "Machine learning is a subset of artificial intelligence...",
    "retrieved_documents": [
        "Document 1 text...",
        "Document 2 text...",
        ...
    ],
    "ground_truth": "Ground truth answer...",
    "metrics": {
        "utilization": 0.85,
        "relevance": 0.92,
        "adherence": 0.88,
        "completeness": 0.79,
        "average": 0.86
    }
}

2. streamlit_app.py - Enhanced Evaluation Display

A. Summary Metrics View

Displays aggregate TRACE scores across all test cases
Shows metrics in table format

B. Detailed Per-Query Analysis

Expandable section for each query showing:
- Question: Original user query
- LLM Response: Generated answer
- Retrieved Documents: Each document in its own expandable section
- Ground Truth: Expected answer (if available)
- TRACE Metrics: Utilization, Relevance, Adherence, Completeness

C. Enhanced JSON Download

Single button downloads complete evaluation with:
- Aggregate scores
- Individual query metrics
- Full per-query details (Q, response, docs, ground truth)
- Evaluation configuration (chunking, embedding model, chunk size, etc.)

JSON Output Structure

{
  "utilization": 0.85,
  "relevance": 0.92,
  "adherence": 0.88,
  "completeness": 0.79,
  "average": 0.86,
  "num_samples": 10,
  "individual_scores": [
    {
      "utilization": 0.85,
      "relevance": 0.92,
      "adherence": 0.88,
      "completeness": 0.79,
      "average": 0.86
    },
    ...
  ],
  "detailed_results": [
    {
      "query_id": 1,
      "question": "What is machine learning?",
      "llm_response": "Machine learning is...",
      "retrieved_documents": ["Doc 1", "Doc 2", ...],
      "ground_truth": "ML is a field of AI...",
      "metrics": {
        "utilization": 0.85,
        "relevance": 0.92,
        "adherence": 0.88,
        "completeness": 0.79,
        "average": 0.86
      }
    },
    ...
  ],
  "evaluation_config": {
    "chunking_strategy": "dense",
    "embedding_model": "sentence-transformers/all-mpnet-base-v2",
    "chunk_size": 512,
    "chunk_overlap": 50
  }
}

How to Use

In Streamlit UI:

Run Evaluation
- Go to "📊 Evaluation" tab
- Select LLM model
- Set number of test samples
- Click "🔬 Run Evaluation"
View Results
- Aggregate metrics displayed at top
- "📋 Summary Metrics by Query" - Table view of all scores
- "🔍 Detailed Per-Query Analysis" - Expandable details for each query
Download Results
- Click "💾 Download Complete Results (JSON)"
- Saves file: trace_evaluation_YYYYMMDD_HHMMSS.json
- Contains all information for analysis and reproducibility

Analyzing Downloaded JSON:

import json

with open("trace_evaluation_20250119_153022.json") as f:
    results = json.load(f)

# Access aggregate scores
print(f"Average Score: {results['average']:.2%}")

# Access per-query details
for query in results['detailed_results']:
    print(f"\nQuery {query['query_id']}: {query['question']}")
    print(f"  Metrics: {query['metrics']}")
    print(f"  Response: {query['llm_response'][:100]}...")
    print(f"  Docs Retrieved: {len(query['retrieved_documents'])}")

Benefits

✅ Reproducibility - Full evaluation context saved with results ✅ Transparency - See exactly what questions were asked and what documents retrieved ✅ Analysis - Easy to analyze correlations between queries, responses, and metrics ✅ Audit Trail - Complete record for reporting and review ✅ Debugging - Identify problematic cases and understand why metrics were low ✅ Comparison - Compare results across different configurations

Files Modified

trace_evaluator.py (Line 359-436)
- Enhanced evaluate_batch() method
- Collects per-query details
- Adds detailed_results field to output
streamlit_app.py (Line 640-682)
- Added detailed per-query analysis section
- Enhanced download with full context
- Improved UI with expandable query details

Example Workflow

1. Create collection with specific embedding model and chunking strategy
2. Run TRACE evaluation on test samples
3. Review metrics in Streamlit UI
4. Click details to inspect specific queries
5. Download JSON for:
   - Archival and reproducibility
   - Further analysis in notebooks
   - Sharing with stakeholders
   - Comparing different configurations

Integration with RAG Pipeline

The evaluation captures:

Queries: From RAGBench test split
Retrieved Documents: From vector store retrieval
LLM Response: From Groq API
Ground Truth: From RAGBench labels
Metrics: Computed by TRACEEvaluator
Configuration: Embedding model, chunking strategy, parameters

All information is preserved in the JSON export for complete traceability.