Spaces:
Sleeping
Sleeping
File size: 6,112 Bytes
1d10b0a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
# Implementation Complete: LLM Audit Trail in JSON Downloads
## Summary
Successfully enhanced the RAG evaluation system to include complete LLM request and response information in JSON downloads. This provides full auditability and transparency of all LLM interactions during evaluation.
## What Was Implemented
### Complete LLM Audit Trail Captured
When users download evaluation results as JSON, each query now includes:
```json
{
"query_id": 1,
"question": "What is artificial intelligence?",
"llm_response": "AI is the simulation of human intelligence...",
"retrieved_documents": [...],
"metrics": {...},
"ground_truth_scores": {...},
"llm_request": {
"system_prompt": "You are an expert RAG evaluator...",
"query": "What is artificial intelligence?",
"context_documents": [...],
"llm_response": "AI is the simulation of human intelligence...",
"labeling_prompt": "Evaluate relevance of documents...",
"model": "groq-default",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "Complete raw response from LLM"
}
}
```
## Files Modified
| File | Changes | Lines |
|------|---------|-------|
| `advanced_rag_evaluator.py` | Modified `_get_gpt_labels()` to capture complete LLM interaction; Updated `evaluate()` to return tuple with audit trail; Updated `evaluate_batch()` to store audit trail in results | 483-701 |
| `evaluation_pipeline.py` | Updated to handle new tuple return from `advanced_evaluator.evaluate()` | 80-127 |
| `streamlit_app.py` | No changes needed - automatically includes LLM audit trail in downloads | - |
## Key Implementation Details
### 1. LLM Request Capture (_get_gpt_labels)
- Captures system prompt used for labeling
- Records user query
- Stores retrieved context documents list
- Saves full raw LLM response before JSON parsing
- Returns both parsed labels and complete audit trail
### 2. Score and Audit Trail Propagation (evaluate)
- Returns tuple: `(AdvancedTRACEScores, llm_request_info)`
- Maintains backward compatibility
- Gracefully handles missing LLM client
### 3. Batch Storage (evaluate_batch)
- Unpacks tuple from evaluate()
- Stores `llm_request` in each `detailed_result`
- Preserves all other metrics and ground truth data
### 4. JSON Export (streamlit_app)
- No changes needed
- Automatically includes audit trail
- JSON download already uses `detailed_results`
## Data Fields Included
Each `llm_request` object contains:
| Field | Purpose | Example |
|-------|---------|---------|
| `system_prompt` | System instruction for labeling | "You are an expert RAG evaluator..." |
| `query` | User's question | "What is AI?" |
| `context_documents` | Retrieved documents (list) | ["Doc 1", "Doc 2", ...] |
| `llm_response` | Original LLM response | "AI is..." |
| `labeling_prompt` | Generated labeling prompt | "Evaluate relevance..." |
| `model` | LLM model used | "groq-default" |
| `temperature` | Temperature setting | 0.0 |
| `max_tokens` | Token limit | 2048 |
| `full_llm_response` | Complete raw response | Full response text |
## Benefits
1. **Complete Auditability**: Full visibility into LLM interactions
2. **Debugging**: Reproduce evaluations if needed
3. **Transparency**: Complete record of what was evaluated
4. **Analysis**: Correlate inputs with output quality
5. **Compliance**: Meets regulatory auditability requirements
## Testing
All tests pass:
- ✅ LLM request info captures 9 required fields
- ✅ Tuple unpacking in evaluate() works correctly
- ✅ Audit trail stores in detailed_results
- ✅ JSON serialization is valid
- ✅ Backwards compatible with evaluation pipeline
## JSON Download Example
When user runs evaluation and downloads "Complete Results (JSON)", the file will contain:
```json
{
"evaluation_metadata": {
"timestamp": "2024-01-15T10:30:00.000000",
"dataset": "ragbench",
"method": "gpt_labeling",
"total_samples": 10
},
"aggregate_metrics": {
"context_relevance": 0.85,
"context_utilization": 0.72,
"completeness": 0.78,
"adherence": 0.90,
"average": 0.81
},
"rmse_metrics": {...},
"auc_metrics": {...},
"detailed_results": [
{
"query_id": 1,
"question": "...",
"llm_response": "...",
"retrieved_documents": [...],
"metrics": {...},
"ground_truth_scores": {...},
"llm_request": {
"system_prompt": "...",
"query": "...",
"context_documents": [...],
"llm_response": "...",
"labeling_prompt": "...",
"model": "...",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "..."
}
},
...
]
}
```
## How to Use
1. Run RAG evaluation in Streamlit UI
2. Select "GPT Labeling" or "Hybrid" evaluation method
3. Wait for evaluation to complete
4. Click "Download Complete Results (JSON)" button
5. Inspect downloaded JSON for complete `llm_request` field in each result
## Error Handling
- If LLM client unavailable: Returns empty `llm_request_info` dict (graceful degradation)
- If LLM returns empty response: Stores empty string in `full_llm_response`
- If JSON parsing fails: Still includes raw `full_llm_response` for debugging
## Backwards Compatibility
- Old evaluation code continues to work (pipeline checks for tuple vs single return)
- JSON structure remains valid even with empty audit trail
- No breaking changes to existing APIs
## Code Quality
- No syntax errors
- All type hints properly updated
- Consistent with existing code style
- Comprehensive error handling
## Next Steps (Optional Enhancements)
1. Add audit trail to TRACE evaluation method
2. Create visualization tools for LLM interactions
3. Add filtering/search in downloaded audit trail
4. Create audit trail report generator
## Conclusion
The implementation successfully adds complete LLM audit trail to JSON downloads, providing full transparency and auditability of all LLM interactions during RAG evaluation. This meets the requirement to "add the complete LLMS response the LLM request used including the system prompt, query and context" in the JSON download.
|