Spaces:
Sleeping
LLM Audit Trail Enhancement - Changes Summary
Overview
Enhanced JSON download functionality to include complete LLM request and response information for debugging and auditing purposes.
Files Modified
1. advanced_rag_evaluator.py
Change 1: _get_gpt_labels() method (Lines 483-549)
What Changed:
- Modified to capture and return complete LLM request/response information
- Changed return type from
Optional[GPTLabelingOutput]toOptional[Dict] - Dictionary structure contains:
"labels": The GPTLabelingOutput object (original return)"llm_request_info": Complete audit trail with:system_prompt: The system instruction usedquery: The user questioncontext_documents: List of retrieved documentsllm_response: The original LLM responselabeling_prompt: The generated labeling promptmodel: Model name (e.g., "groq-default")temperature: Temperature parameter (0.0)max_tokens: Max tokens setting (2048)full_llm_response: Complete raw response from LLM
Why: Enables complete auditability of LLM interactions
Change 2: evaluate() method (Lines 414-473)
What Changed:
- Changed return type from
AdvancedTRACEScorestoTuple[AdvancedTRACEScores, Optional[Dict]] - Now returns tuple: (scores, llm_request_info)
- Handles both old return style and new tuple return from _get_gpt_labels()
- Passes llm_request_info through to caller
Why: Allows scores and LLM audit info to flow through evaluation pipeline
Change 3: evaluate_batch() method (Lines 627-701)
What Changed:
- Updated to unpack tuple from evaluate():
scores, llm_request_info = self.evaluate(...) - Added
"llm_request"field to each result_dict in detailed_results - Structure: Each detailed_result now includes:
{ "query_id": 1, "question": "...", "llm_response": "...", "retrieved_documents": [...], "metrics": {...}, "ground_truth_scores": {...}, "llm_request": { "system_prompt": "...", "query": "...", "context_documents": [...], "llm_response": "...", "full_llm_response": "...", "model": "...", "temperature": 0.0, "max_tokens": 2048 } }
Why: Stores complete LLM audit trail for each query in batch evaluation
2. evaluation_pipeline.py
Change: evaluate() method (Lines 80-127)
What Changed:
- Updated to handle new tuple return from
advanced_evaluator.evaluate() - Added type checking:
if isinstance(result, tuple) - Backwards compatible with both old and new return types
- Extracts scores and llm_info separately
- Adds
"llm_request_info"to returned evaluation result dictionary
Why: Maintains compatibility while propagating LLM audit info through pipeline
3. streamlit_app.py
No changes needed
- Already uses
detailed_resultsin JSON download (line 818) - New
llm_requestfield automatically included in JSON export - JSON download structure already supports arbitrary nested fields
Data Flow
_get_gpt_labels()
ββ Captures complete LLM interaction
ββ Returns: {"labels": GPTLabelingOutput, "llm_request_info": {...}}
β
evaluate()
ββ Unpacks tuple from _get_gpt_labels()
ββ Returns: (AdvancedTRACEScores, llm_request_info)
β
evaluate_batch()
ββ Unpack score and llm_request_info
ββ Store llm_request in detailed_results[i]["llm_request"]
ββ Returns: Dict with detailed_results containing llm_request
β
JSON Download (streamlit_app.py)
ββ detailed_results automatically includes llm_request
ββ Downloads complete audit trail for each query
JSON Download Structure
After downloading, the JSON file contains:
{
"evaluation_metadata": {...},
"aggregate_metrics": {...},
"rmse_metrics": {...},
"auc_metrics": {...},
"detailed_results": [
{
"query_id": 1,
"question": "User's question",
"llm_response": "LLM's answer",
"retrieved_documents": ["doc1", "doc2", ...],
"metrics": {...},
"ground_truth_scores": {...},
"llm_request": {
"system_prompt": "System instruction for labeling",
"query": "User question",
"context_documents": [...],
"llm_response": "Raw LLM response before JSON parsing",
"labeling_prompt": "Generated prompt sent to LLM",
"model": "groq-default",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "Complete raw response from LLM"
}
},
...
]
}
Benefits
- Complete Audit Trail: Full visibility into what was sent to LLM and what was returned
- Debugging: Can reproduce evaluation if needed
- Transparency: Complete record of LLM interactions
- Analysis: Can correlate LLM inputs with output quality
- Compliance: Meets auditability requirements
Testing Recommendations
- Run evaluation with GPT Labeling method
- Download complete results JSON
- Verify each detailed_result contains llm_request field
- Verify llm_request contains:
- system_prompt
- query (user question)
- context_documents list
- labeling_prompt
- full_llm_response
Backwards Compatibility
- Changes are backwards compatible
- evaluation_pipeline.py checks
isinstance(result, tuple)to handle both old and new return types - If LLM client unavailable, gracefully returns empty llm_request_info dict
- JSON download structure remains valid even if llm_request is empty
Implementation Status
β
_get_gpt_labels() captures complete LLM interaction
β
evaluate() returns scores + LLM info
β
evaluate_batch() stores LLM info in detailed_results
β
JSON download automatically includes llm_request
β
All files compile without errors
β
Backwards compatible