CapStoneRAG10 / docs /TRACE_DETAILED_RESULTS.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

TRACE Evaluation - Detailed Results Export

Overview

Enhanced TRACE evaluation to save comprehensive per-query data including questions, retrieved documents, LLM responses, and metrics for each test case.

Changes Made

1. trace_evaluator.py - Updated evaluate_batch() method

Added detailed_results array to the evaluation output that captures:

{
    "query_id": 1,
    "question": "What is machine learning?",
    "llm_response": "Machine learning is a subset of artificial intelligence...",
    "retrieved_documents": [
        "Document 1 text...",
        "Document 2 text...",
        ...
    ],
    "ground_truth": "Ground truth answer...",
    "metrics": {
        "utilization": 0.85,
        "relevance": 0.92,
        "adherence": 0.88,
        "completeness": 0.79,
        "average": 0.86
    }
}

2. streamlit_app.py - Enhanced Evaluation Display

A. Summary Metrics View

  • Displays aggregate TRACE scores across all test cases
  • Shows metrics in table format

B. Detailed Per-Query Analysis

  • Expandable section for each query showing:
    • Question: Original user query
    • LLM Response: Generated answer
    • Retrieved Documents: Each document in its own expandable section
    • Ground Truth: Expected answer (if available)
    • TRACE Metrics: Utilization, Relevance, Adherence, Completeness

C. Enhanced JSON Download

  • Single button downloads complete evaluation with:
    • Aggregate scores
    • Individual query metrics
    • Full per-query details (Q, response, docs, ground truth)
    • Evaluation configuration (chunking, embedding model, chunk size, etc.)

JSON Output Structure

{
  "utilization": 0.85,
  "relevance": 0.92,
  "adherence": 0.88,
  "completeness": 0.79,
  "average": 0.86,
  "num_samples": 10,
  "individual_scores": [
    {
      "utilization": 0.85,
      "relevance": 0.92,
      "adherence": 0.88,
      "completeness": 0.79,
      "average": 0.86
    },
    ...
  ],
  "detailed_results": [
    {
      "query_id": 1,
      "question": "What is machine learning?",
      "llm_response": "Machine learning is...",
      "retrieved_documents": ["Doc 1", "Doc 2", ...],
      "ground_truth": "ML is a field of AI...",
      "metrics": {
        "utilization": 0.85,
        "relevance": 0.92,
        "adherence": 0.88,
        "completeness": 0.79,
        "average": 0.86
      }
    },
    ...
  ],
  "evaluation_config": {
    "chunking_strategy": "dense",
    "embedding_model": "sentence-transformers/all-mpnet-base-v2",
    "chunk_size": 512,
    "chunk_overlap": 50
  }
}

How to Use

In Streamlit UI:

  1. Run Evaluation

    • Go to "πŸ“Š Evaluation" tab
    • Select LLM model
    • Set number of test samples
    • Click "πŸ”¬ Run Evaluation"
  2. View Results

    • Aggregate metrics displayed at top
    • "πŸ“‹ Summary Metrics by Query" - Table view of all scores
    • "πŸ” Detailed Per-Query Analysis" - Expandable details for each query
  3. Download Results

    • Click "πŸ’Ύ Download Complete Results (JSON)"
    • Saves file: trace_evaluation_YYYYMMDD_HHMMSS.json
    • Contains all information for analysis and reproducibility

Analyzing Downloaded JSON:

import json

with open("trace_evaluation_20250119_153022.json") as f:
    results = json.load(f)

# Access aggregate scores
print(f"Average Score: {results['average']:.2%}")

# Access per-query details
for query in results['detailed_results']:
    print(f"\nQuery {query['query_id']}: {query['question']}")
    print(f"  Metrics: {query['metrics']}")
    print(f"  Response: {query['llm_response'][:100]}...")
    print(f"  Docs Retrieved: {len(query['retrieved_documents'])}")

Benefits

βœ… Reproducibility - Full evaluation context saved with results βœ… Transparency - See exactly what questions were asked and what documents retrieved βœ… Analysis - Easy to analyze correlations between queries, responses, and metrics βœ… Audit Trail - Complete record for reporting and review βœ… Debugging - Identify problematic cases and understand why metrics were low βœ… Comparison - Compare results across different configurations

Files Modified

  1. trace_evaluator.py (Line 359-436)

    • Enhanced evaluate_batch() method
    • Collects per-query details
    • Adds detailed_results field to output
  2. streamlit_app.py (Line 640-682)

    • Added detailed per-query analysis section
    • Enhanced download with full context
    • Improved UI with expandable query details

Example Workflow

1. Create collection with specific embedding model and chunking strategy
2. Run TRACE evaluation on test samples
3. Review metrics in Streamlit UI
4. Click details to inspect specific queries
5. Download JSON for:
   - Archival and reproducibility
   - Further analysis in notebooks
   - Sharing with stakeholders
   - Comparing different configurations

Integration with RAG Pipeline

The evaluation captures:

  • Queries: From RAGBench test split
  • Retrieved Documents: From vector store retrieval
  • LLM Response: From Groq API
  • Ground Truth: From RAGBench labels
  • Metrics: Computed by TRACEEvaluator
  • Configuration: Embedding model, chunking strategy, parameters

All information is preserved in the JSON export for complete traceability.