Spaces:
Running
Verification Guide: LLM Audit Trail Feature
Quick Start: How to Test the Implementation
Step 1: Start the Application
cd "d:\CapStoneProject\RAG Capstone Project"
streamlit run streamlit_app.py
Step 2: Run an Evaluation
- Select RAGBench dataset
- Choose GPT Labeling or Hybrid evaluation method
- Set a small sample count (1-3 for testing)
- Click "Start Evaluation"
- Wait for evaluation to complete
Step 3: Download Results
- Scroll to "πΎ Download Results" section
- Click "π₯ Download Complete Results (JSON)" button
- Save the file to your computer
Step 4: Inspect the JSON
Open the downloaded JSON file with a text editor and verify:
{
"evaluation_metadata": {...},
"aggregate_metrics": {...},
"detailed_results": [
{
"query_id": 1,
"question": "...",
"llm_request": {
"system_prompt": "You are an expert RAG evaluator...",
"query": "...",
"context_documents": ["doc1", "doc2", ...],
"llm_response": "...",
"labeling_prompt": "...",
"model": "groq-default",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "..."
}
}
]
}
Verification Checklist
Code-Level Verification
# 1. Check for syntax errors
python -m py_compile advanced_rag_evaluator.py
python -m py_compile evaluation_pipeline.py
# 2. Run the test script
python test_llm_audit_trail.py
# Expected output should show:
# ======================================================================
# RESULT: ALL TESTS PASSED
# ======================================================================
JSON Structure Verification
The downloaded JSON should contain:
-
evaluation_metadatawith timestamp, dataset, method, total_samples -
aggregate_metricswith main metrics -
rmse_metricsif available -
auc_metricsif available -
detailed_resultsarray with multiple query results - Each detailed_result contains:
-
query_id: Integer starting from 1 -
question: The user's question -
llm_response: The LLM's response -
retrieved_documents: Array of context documents -
metrics: Dictionary with metric scores -
ground_truth_scores: Dictionary with ground truth values -
llm_request: Dictionary containing:-
system_prompt: System instruction (non-empty string) -
query: User question (matchesquestionfield) -
context_documents: Array of documents (matchesretrieved_documents) -
llm_response: Original response (matchesllm_responsefield) -
labeling_prompt: Generated prompt (non-empty string) -
model: Model name (e.g., "groq-default") -
temperature: Should be 0.0 -
max_tokens: Should be 2048 -
full_llm_response: Complete raw response (non-empty string)
-
-
Functional Verification
Test 1: Basic Functionality
from advanced_rag_evaluator import AdvancedRAGEvaluator
evaluator = AdvancedRAGEvaluator(llm_client=client, ...)
test_case = {
"query": "What is AI?",
"response": "AI is artificial intelligence...",
"retrieved_documents": ["AI doc 1", "AI doc 2"]
}
# Should return dict with "detailed_results" containing "llm_request"
result = evaluator.evaluate_batch([test_case])
assert "detailed_results" in result
assert "llm_request" in result["detailed_results"][0]
assert "system_prompt" in result["detailed_results"][0]["llm_request"]
print("[PASS] LLM audit trail is stored correctly")
Test 2: JSON Serialization
import json
# Download JSON and verify it's valid
with open("evaluation_results.json", "r") as f:
data = json.load(f)
# Verify structure
assert "detailed_results" in data
for result in data["detailed_results"]:
assert "llm_request" in result
assert result["llm_request"].get("system_prompt")
assert result["llm_request"].get("query")
assert result["llm_request"].get("context_documents")
assert result["llm_request"].get("full_llm_response")
print("[PASS] JSON structure is valid and complete")
Test 3: Backwards Compatibility
# Old code should still work
result = evaluator.evaluate(
question="What is AI?",
response="AI is...",
retrieved_documents=["doc1", "doc2"]
)
# New code returns tuple
scores, llm_info = result
assert scores is not None
assert isinstance(llm_info, dict)
print("[PASS] Backwards compatible tuple unpacking works")
Expected Results
When you download the JSON and inspect it:
- LLM Request Field Present: Each query result contains a complete
llm_requestobject - All 9 Fields Present: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response)
- Data Consistency: Values in
llm_requestmatch corresponding fields in the query result - JSON Valid: File is valid JSON that can be parsed and inspected
- Complete Audit Trail: Full visibility into what was sent to LLM and what it returned
What Each Field Represents
| Field | Value | Purpose |
|---|---|---|
system_prompt |
"You are an expert RAG evaluator..." | System instruction given to LLM for labeling |
query |
"What is artificial intelligence?" | The user's question being evaluated |
context_documents |
Array of document strings | Retrieved context documents provided to LLM |
llm_response |
"AI is the simulation..." | Original LLM response being evaluated |
labeling_prompt |
Long prompt text | Generated prompt with instructions for labeling |
model |
"groq-default" | Which LLM model was used |
temperature |
0.0 | Temperature setting (0 = deterministic) |
max_tokens |
2048 | Token limit used for LLM call |
full_llm_response |
Complete raw response | Exact response from LLM before JSON parsing |
Common Issues and Solutions
Issue 1: llm_request field is empty/missing
Cause: LLM client not available or failed Solution: Ensure Groq API key is configured and network is available
Issue 2: Context documents empty in llm_request
Cause: Documents not retrieved properly Solution: Check that document retrieval is working in evaluation pipeline
Issue 3: JSON file not downloading
Cause: Large file size or Streamlit issue Solution: Ensure browser has sufficient memory and try refreshing page
Issue 4: Unicode encoding errors
Cause: Special characters in LLM response Solution: Open JSON with UTF-8 encoding
# Windows
notepad.exe evaluation_results.json
# Then: File > Save As > Encoding: UTF-8
# Or use Python
import json
with open("evaluation_results.json", "r", encoding="utf-8") as f:
data = json.load(f)
Running Tests
Automated Test Suite
# Run the comprehensive test
python test_llm_audit_trail.py
# Should see:
# [STEP 1] _get_gpt_labels() returns dict with audit trail
# [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info)
# [STEP 3] evaluate_batch() stores llm_request in detailed_results
# [STEP 4] JSON download includes complete audit trail
# [STEP 5] Validation checks
# RESULT: ALL TESTS PASSED
Manual Testing Steps
Test with Single Query
- Run evaluation with 1 sample
- Download JSON
- Verify llm_request has all fields
Test with Multiple Queries
- Run evaluation with 5 samples
- Download JSON
- Verify each query has complete llm_request
Test Data Consistency
- Compare llm_request.query with root question field
- Compare llm_request.context_documents with retrieved_documents
- Verify all strings are non-empty
Test File Size
- Check JSON file is reasonable size (typically 50-500KB for 10 queries)
- Verify file opens in text editor without issues
Success Criteria
β All items in checklist above are verified β Test script runs without errors β JSON downloads successfully β llm_request field present in all results β All 9 required fields populated β JSON is valid and well-formed β File can be opened and inspected β Data is consistent across results
Next Steps After Verification
- Review Audit Trail: Inspect the captured LLM interactions
- Validate Quality: Check if prompts and responses look correct
- Test Reproduction: Use the captured data to reproduce evaluations if needed
- Archive Results: Store JSON for compliance/auditing purposes
- Iterate: Use insights from audit trail to improve prompts if needed
Support
If you encounter issues:
- Check error messages in Streamlit console
- Review LLMAUDITTRAIL_CHANGES.md for implementation details
- Run test_llm_audit_trail.py for automated diagnostics
- Check CODE_CHANGES_REFERENCE.md for code-level details