Spaces:
Sleeping
Sleeping
File size: 5,338 Bytes
1d10b0a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
# TRACE Evaluation - Detailed Results Export
## Overview
Enhanced TRACE evaluation to save comprehensive per-query data including questions, retrieved documents, LLM responses, and metrics for each test case.
## Changes Made
### 1. **trace_evaluator.py** - Updated `evaluate_batch()` method
Added `detailed_results` array to the evaluation output that captures:
```python
{
"query_id": 1,
"question": "What is machine learning?",
"llm_response": "Machine learning is a subset of artificial intelligence...",
"retrieved_documents": [
"Document 1 text...",
"Document 2 text...",
...
],
"ground_truth": "Ground truth answer...",
"metrics": {
"utilization": 0.85,
"relevance": 0.92,
"adherence": 0.88,
"completeness": 0.79,
"average": 0.86
}
}
```
### 2. **streamlit_app.py** - Enhanced Evaluation Display
#### A. Summary Metrics View
- Displays aggregate TRACE scores across all test cases
- Shows metrics in table format
#### B. Detailed Per-Query Analysis
- Expandable section for each query showing:
- **Question**: Original user query
- **LLM Response**: Generated answer
- **Retrieved Documents**: Each document in its own expandable section
- **Ground Truth**: Expected answer (if available)
- **TRACE Metrics**: Utilization, Relevance, Adherence, Completeness
#### C. Enhanced JSON Download
- Single button downloads complete evaluation with:
- Aggregate scores
- Individual query metrics
- Full per-query details (Q, response, docs, ground truth)
- Evaluation configuration (chunking, embedding model, chunk size, etc.)
## JSON Output Structure
```json
{
"utilization": 0.85,
"relevance": 0.92,
"adherence": 0.88,
"completeness": 0.79,
"average": 0.86,
"num_samples": 10,
"individual_scores": [
{
"utilization": 0.85,
"relevance": 0.92,
"adherence": 0.88,
"completeness": 0.79,
"average": 0.86
},
...
],
"detailed_results": [
{
"query_id": 1,
"question": "What is machine learning?",
"llm_response": "Machine learning is...",
"retrieved_documents": ["Doc 1", "Doc 2", ...],
"ground_truth": "ML is a field of AI...",
"metrics": {
"utilization": 0.85,
"relevance": 0.92,
"adherence": 0.88,
"completeness": 0.79,
"average": 0.86
}
},
...
],
"evaluation_config": {
"chunking_strategy": "dense",
"embedding_model": "sentence-transformers/all-mpnet-base-v2",
"chunk_size": 512,
"chunk_overlap": 50
}
}
```
## How to Use
### In Streamlit UI:
1. **Run Evaluation**
- Go to "π Evaluation" tab
- Select LLM model
- Set number of test samples
- Click "π¬ Run Evaluation"
2. **View Results**
- Aggregate metrics displayed at top
- "π Summary Metrics by Query" - Table view of all scores
- "π Detailed Per-Query Analysis" - Expandable details for each query
3. **Download Results**
- Click "πΎ Download Complete Results (JSON)"
- Saves file: `trace_evaluation_YYYYMMDD_HHMMSS.json`
- Contains all information for analysis and reproducibility
### Analyzing Downloaded JSON:
```python
import json
with open("trace_evaluation_20250119_153022.json") as f:
results = json.load(f)
# Access aggregate scores
print(f"Average Score: {results['average']:.2%}")
# Access per-query details
for query in results['detailed_results']:
print(f"\nQuery {query['query_id']}: {query['question']}")
print(f" Metrics: {query['metrics']}")
print(f" Response: {query['llm_response'][:100]}...")
print(f" Docs Retrieved: {len(query['retrieved_documents'])}")
```
## Benefits
β
**Reproducibility** - Full evaluation context saved with results
β
**Transparency** - See exactly what questions were asked and what documents retrieved
β
**Analysis** - Easy to analyze correlations between queries, responses, and metrics
β
**Audit Trail** - Complete record for reporting and review
β
**Debugging** - Identify problematic cases and understand why metrics were low
β
**Comparison** - Compare results across different configurations
## Files Modified
1. **trace_evaluator.py** (Line 359-436)
- Enhanced `evaluate_batch()` method
- Collects per-query details
- Adds `detailed_results` field to output
2. **streamlit_app.py** (Line 640-682)
- Added detailed per-query analysis section
- Enhanced download with full context
- Improved UI with expandable query details
## Example Workflow
```
1. Create collection with specific embedding model and chunking strategy
2. Run TRACE evaluation on test samples
3. Review metrics in Streamlit UI
4. Click details to inspect specific queries
5. Download JSON for:
- Archival and reproducibility
- Further analysis in notebooks
- Sharing with stakeholders
- Comparing different configurations
```
## Integration with RAG Pipeline
The evaluation captures:
- **Queries**: From RAGBench test split
- **Retrieved Documents**: From vector store retrieval
- **LLM Response**: From Groq API
- **Ground Truth**: From RAGBench labels
- **Metrics**: Computed by TRACEEvaluator
- **Configuration**: Embedding model, chunking strategy, parameters
All information is preserved in the JSON export for complete traceability.
|