Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 5,338 Bytes

1d10b0a

# TRACE Evaluation - Detailed Results Export

## Overview

Enhanced TRACE evaluation to save comprehensive per-query data including questions, retrieved documents, LLM responses, and metrics for each test case.

## Changes Made

### 1. **trace_evaluator.py** - Updated `evaluate_batch()` method

Added `detailed_results` array to the evaluation output that captures:

```python
{
    "query_id": 1,
    "question": "What is machine learning?",
    "llm_response": "Machine learning is a subset of artificial intelligence...",
    "retrieved_documents": [
        "Document 1 text...",
        "Document 2 text...",
        ...
    ],
    "ground_truth": "Ground truth answer...",
    "metrics": {
        "utilization": 0.85,
        "relevance": 0.92,
        "adherence": 0.88,
        "completeness": 0.79,
        "average": 0.86
    }
}
```

### 2. **streamlit_app.py** - Enhanced Evaluation Display

#### A. Summary Metrics View
- Displays aggregate TRACE scores across all test cases
- Shows metrics in table format

#### B. Detailed Per-Query Analysis
- Expandable section for each query showing:
  - **Question**: Original user query
  - **LLM Response**: Generated answer
  - **Retrieved Documents**: Each document in its own expandable section
  - **Ground Truth**: Expected answer (if available)
  - **TRACE Metrics**: Utilization, Relevance, Adherence, Completeness

#### C. Enhanced JSON Download
- Single button downloads complete evaluation with:
  - Aggregate scores
  - Individual query metrics
  - Full per-query details (Q, response, docs, ground truth)
  - Evaluation configuration (chunking, embedding model, chunk size, etc.)

## JSON Output Structure

```json
{
  "utilization": 0.85,
  "relevance": 0.92,
  "adherence": 0.88,
  "completeness": 0.79,
  "average": 0.86,
  "num_samples": 10,
  "individual_scores": [
    {
      "utilization": 0.85,
      "relevance": 0.92,
      "adherence": 0.88,
      "completeness": 0.79,
      "average": 0.86
    },
    ...
  ],
  "detailed_results": [
    {
      "query_id": 1,
      "question": "What is machine learning?",
      "llm_response": "Machine learning is...",
      "retrieved_documents": ["Doc 1", "Doc 2", ...],
      "ground_truth": "ML is a field of AI...",
      "metrics": {
        "utilization": 0.85,
        "relevance": 0.92,
        "adherence": 0.88,
        "completeness": 0.79,
        "average": 0.86
      }
    },
    ...
  ],
  "evaluation_config": {
    "chunking_strategy": "dense",
    "embedding_model": "sentence-transformers/all-mpnet-base-v2",
    "chunk_size": 512,
    "chunk_overlap": 50
  }
}
```

## How to Use

### In Streamlit UI:

1. **Run Evaluation**
   - Go to "📊 Evaluation" tab
   - Select LLM model
   - Set number of test samples
   - Click "🔬 Run Evaluation"

2. **View Results**
   - Aggregate metrics displayed at top
   - "📋 Summary Metrics by Query" - Table view of all scores
   - "🔍 Detailed Per-Query Analysis" - Expandable details for each query

3. **Download Results**
   - Click "💾 Download Complete Results (JSON)"
   - Saves file: `trace_evaluation_YYYYMMDD_HHMMSS.json`
   - Contains all information for analysis and reproducibility

### Analyzing Downloaded JSON:

```python
import json

with open("trace_evaluation_20250119_153022.json") as f:
    results = json.load(f)

# Access aggregate scores
print(f"Average Score: {results['average']:.2%}")

# Access per-query details
for query in results['detailed_results']:
    print(f"\nQuery {query['query_id']}: {query['question']}")
    print(f"  Metrics: {query['metrics']}")
    print(f"  Response: {query['llm_response'][:100]}...")
    print(f"  Docs Retrieved: {len(query['retrieved_documents'])}")
```

## Benefits

✅ **Reproducibility** - Full evaluation context saved with results
✅ **Transparency** - See exactly what questions were asked and what documents retrieved
✅ **Analysis** - Easy to analyze correlations between queries, responses, and metrics
✅ **Audit Trail** - Complete record for reporting and review
✅ **Debugging** - Identify problematic cases and understand why metrics were low
✅ **Comparison** - Compare results across different configurations

## Files Modified

1. **trace_evaluator.py** (Line 359-436)
   - Enhanced `evaluate_batch()` method
   - Collects per-query details
   - Adds `detailed_results` field to output

2. **streamlit_app.py** (Line 640-682)
   - Added detailed per-query analysis section
   - Enhanced download with full context
   - Improved UI with expandable query details

## Example Workflow

```
1. Create collection with specific embedding model and chunking strategy
2. Run TRACE evaluation on test samples
3. Review metrics in Streamlit UI
4. Click details to inspect specific queries
5. Download JSON for:
   - Archival and reproducibility
   - Further analysis in notebooks
   - Sharing with stakeholders
   - Comparing different configurations
```

## Integration with RAG Pipeline

The evaluation captures:
- **Queries**: From RAGBench test split
- **Retrieved Documents**: From vector store retrieval
- **LLM Response**: From Groq API
- **Ground Truth**: From RAGBench labels
- **Metrics**: Computed by TRACEEvaluator
- **Configuration**: Embedding model, chunking strategy, parameters

All information is preserved in the JSON export for complete traceability.