Spaces:
Sleeping
Sleeping
Implementation Complete: LLM Audit Trail in JSON Downloads
Summary
Successfully enhanced the RAG evaluation system to include complete LLM request and response information in JSON downloads. This provides full auditability and transparency of all LLM interactions during evaluation.
What Was Implemented
Complete LLM Audit Trail Captured
When users download evaluation results as JSON, each query now includes:
{
"query_id": 1,
"question": "What is artificial intelligence?",
"llm_response": "AI is the simulation of human intelligence...",
"retrieved_documents": [...],
"metrics": {...},
"ground_truth_scores": {...},
"llm_request": {
"system_prompt": "You are an expert RAG evaluator...",
"query": "What is artificial intelligence?",
"context_documents": [...],
"llm_response": "AI is the simulation of human intelligence...",
"labeling_prompt": "Evaluate relevance of documents...",
"model": "groq-default",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "Complete raw response from LLM"
}
}
Files Modified
| File | Changes | Lines |
|---|---|---|
advanced_rag_evaluator.py |
Modified _get_gpt_labels() to capture complete LLM interaction; Updated evaluate() to return tuple with audit trail; Updated evaluate_batch() to store audit trail in results |
483-701 |
evaluation_pipeline.py |
Updated to handle new tuple return from advanced_evaluator.evaluate() |
80-127 |
streamlit_app.py |
No changes needed - automatically includes LLM audit trail in downloads | - |
Key Implementation Details
1. LLM Request Capture (_get_gpt_labels)
- Captures system prompt used for labeling
- Records user query
- Stores retrieved context documents list
- Saves full raw LLM response before JSON parsing
- Returns both parsed labels and complete audit trail
2. Score and Audit Trail Propagation (evaluate)
- Returns tuple:
(AdvancedTRACEScores, llm_request_info) - Maintains backward compatibility
- Gracefully handles missing LLM client
3. Batch Storage (evaluate_batch)
- Unpacks tuple from evaluate()
- Stores
llm_requestin eachdetailed_result - Preserves all other metrics and ground truth data
4. JSON Export (streamlit_app)
- No changes needed
- Automatically includes audit trail
- JSON download already uses
detailed_results
Data Fields Included
Each llm_request object contains:
| Field | Purpose | Example |
|---|---|---|
system_prompt |
System instruction for labeling | "You are an expert RAG evaluator..." |
query |
User's question | "What is AI?" |
context_documents |
Retrieved documents (list) | ["Doc 1", "Doc 2", ...] |
llm_response |
Original LLM response | "AI is..." |
labeling_prompt |
Generated labeling prompt | "Evaluate relevance..." |
model |
LLM model used | "groq-default" |
temperature |
Temperature setting | 0.0 |
max_tokens |
Token limit | 2048 |
full_llm_response |
Complete raw response | Full response text |
Benefits
- Complete Auditability: Full visibility into LLM interactions
- Debugging: Reproduce evaluations if needed
- Transparency: Complete record of what was evaluated
- Analysis: Correlate inputs with output quality
- Compliance: Meets regulatory auditability requirements
Testing
All tests pass:
- ✅ LLM request info captures 9 required fields
- ✅ Tuple unpacking in evaluate() works correctly
- ✅ Audit trail stores in detailed_results
- ✅ JSON serialization is valid
- ✅ Backwards compatible with evaluation pipeline
JSON Download Example
When user runs evaluation and downloads "Complete Results (JSON)", the file will contain:
{
"evaluation_metadata": {
"timestamp": "2024-01-15T10:30:00.000000",
"dataset": "ragbench",
"method": "gpt_labeling",
"total_samples": 10
},
"aggregate_metrics": {
"context_relevance": 0.85,
"context_utilization": 0.72,
"completeness": 0.78,
"adherence": 0.90,
"average": 0.81
},
"rmse_metrics": {...},
"auc_metrics": {...},
"detailed_results": [
{
"query_id": 1,
"question": "...",
"llm_response": "...",
"retrieved_documents": [...],
"metrics": {...},
"ground_truth_scores": {...},
"llm_request": {
"system_prompt": "...",
"query": "...",
"context_documents": [...],
"llm_response": "...",
"labeling_prompt": "...",
"model": "...",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "..."
}
},
...
]
}
How to Use
- Run RAG evaluation in Streamlit UI
- Select "GPT Labeling" or "Hybrid" evaluation method
- Wait for evaluation to complete
- Click "Download Complete Results (JSON)" button
- Inspect downloaded JSON for complete
llm_requestfield in each result
Error Handling
- If LLM client unavailable: Returns empty
llm_request_infodict (graceful degradation) - If LLM returns empty response: Stores empty string in
full_llm_response - If JSON parsing fails: Still includes raw
full_llm_responsefor debugging
Backwards Compatibility
- Old evaluation code continues to work (pipeline checks for tuple vs single return)
- JSON structure remains valid even with empty audit trail
- No breaking changes to existing APIs
Code Quality
- No syntax errors
- All type hints properly updated
- Consistent with existing code style
- Comprehensive error handling
Next Steps (Optional Enhancements)
- Add audit trail to TRACE evaluation method
- Create visualization tools for LLM interactions
- Add filtering/search in downloaded audit trail
- Create audit trail report generator
Conclusion
The implementation successfully adds complete LLM audit trail to JSON downloads, providing full transparency and auditability of all LLM interactions during RAG evaluation. This meets the requirement to "add the complete LLMS response the LLM request used including the system prompt, query and context" in the JSON download.