# Verification Guide: LLM Audit Trail Feature ## Quick Start: How to Test the Implementation ### Step 1: Start the Application ```bash cd "d:\CapStoneProject\RAG Capstone Project" streamlit run streamlit_app.py ``` ### Step 2: Run an Evaluation 1. Select **RAGBench** dataset 2. Choose **GPT Labeling** or **Hybrid** evaluation method 3. Set a small sample count (1-3 for testing) 4. Click "Start Evaluation" 5. Wait for evaluation to complete ### Step 3: Download Results 1. Scroll to "💾 Download Results" section 2. Click "📥 Download Complete Results (JSON)" button 3. Save the file to your computer ### Step 4: Inspect the JSON Open the downloaded JSON file with a text editor and verify: ```json { "evaluation_metadata": {...}, "aggregate_metrics": {...}, "detailed_results": [ { "query_id": 1, "question": "...", "llm_request": { "system_prompt": "You are an expert RAG evaluator...", "query": "...", "context_documents": ["doc1", "doc2", ...], "llm_response": "...", "labeling_prompt": "...", "model": "groq-default", "temperature": 0.0, "max_tokens": 2048, "full_llm_response": "..." } } ] } ``` ## Verification Checklist ### Code-Level Verification ```bash # 1. Check for syntax errors python -m py_compile advanced_rag_evaluator.py python -m py_compile evaluation_pipeline.py # 2. Run the test script python test_llm_audit_trail.py # Expected output should show: # ====================================================================== # RESULT: ALL TESTS PASSED # ====================================================================== ``` ### JSON Structure Verification The downloaded JSON should contain: - [ ] `evaluation_metadata` with timestamp, dataset, method, total_samples - [ ] `aggregate_metrics` with main metrics - [ ] `rmse_metrics` if available - [ ] `auc_metrics` if available - [ ] `detailed_results` array with multiple query results - [ ] Each detailed_result contains: - [ ] `query_id`: Integer starting from 1 - [ ] `question`: The user's question - [ ] `llm_response`: The LLM's response - [ ] `retrieved_documents`: Array of context documents - [ ] `metrics`: Dictionary with metric scores - [ ] `ground_truth_scores`: Dictionary with ground truth values - [ ] `llm_request`: Dictionary containing: - [ ] `system_prompt`: System instruction (non-empty string) - [ ] `query`: User question (matches `question` field) - [ ] `context_documents`: Array of documents (matches `retrieved_documents`) - [ ] `llm_response`: Original response (matches `llm_response` field) - [ ] `labeling_prompt`: Generated prompt (non-empty string) - [ ] `model`: Model name (e.g., "groq-default") - [ ] `temperature`: Should be 0.0 - [ ] `max_tokens`: Should be 2048 - [ ] `full_llm_response`: Complete raw response (non-empty string) ### Functional Verification **Test 1: Basic Functionality** ```python from advanced_rag_evaluator import AdvancedRAGEvaluator evaluator = AdvancedRAGEvaluator(llm_client=client, ...) test_case = { "query": "What is AI?", "response": "AI is artificial intelligence...", "retrieved_documents": ["AI doc 1", "AI doc 2"] } # Should return dict with "detailed_results" containing "llm_request" result = evaluator.evaluate_batch([test_case]) assert "detailed_results" in result assert "llm_request" in result["detailed_results"][0] assert "system_prompt" in result["detailed_results"][0]["llm_request"] print("[PASS] LLM audit trail is stored correctly") ``` **Test 2: JSON Serialization** ```python import json # Download JSON and verify it's valid with open("evaluation_results.json", "r") as f: data = json.load(f) # Verify structure assert "detailed_results" in data for result in data["detailed_results"]: assert "llm_request" in result assert result["llm_request"].get("system_prompt") assert result["llm_request"].get("query") assert result["llm_request"].get("context_documents") assert result["llm_request"].get("full_llm_response") print("[PASS] JSON structure is valid and complete") ``` **Test 3: Backwards Compatibility** ```python # Old code should still work result = evaluator.evaluate( question="What is AI?", response="AI is...", retrieved_documents=["doc1", "doc2"] ) # New code returns tuple scores, llm_info = result assert scores is not None assert isinstance(llm_info, dict) print("[PASS] Backwards compatible tuple unpacking works") ``` ## Expected Results When you download the JSON and inspect it: 1. **LLM Request Field Present**: Each query result contains a complete `llm_request` object 2. **All 9 Fields Present**: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response) 3. **Data Consistency**: Values in `llm_request` match corresponding fields in the query result 4. **JSON Valid**: File is valid JSON that can be parsed and inspected 5. **Complete Audit Trail**: Full visibility into what was sent to LLM and what it returned ## What Each Field Represents | Field | Value | Purpose | |-------|-------|---------| | `system_prompt` | "You are an expert RAG evaluator..." | System instruction given to LLM for labeling | | `query` | "What is artificial intelligence?" | The user's question being evaluated | | `context_documents` | Array of document strings | Retrieved context documents provided to LLM | | `llm_response` | "AI is the simulation..." | Original LLM response being evaluated | | `labeling_prompt` | Long prompt text | Generated prompt with instructions for labeling | | `model` | "groq-default" | Which LLM model was used | | `temperature` | 0.0 | Temperature setting (0 = deterministic) | | `max_tokens` | 2048 | Token limit used for LLM call | | `full_llm_response` | Complete raw response | Exact response from LLM before JSON parsing | ## Common Issues and Solutions ### Issue 1: llm_request field is empty/missing **Cause**: LLM client not available or failed **Solution**: Ensure Groq API key is configured and network is available ### Issue 2: Context documents empty in llm_request **Cause**: Documents not retrieved properly **Solution**: Check that document retrieval is working in evaluation pipeline ### Issue 3: JSON file not downloading **Cause**: Large file size or Streamlit issue **Solution**: Ensure browser has sufficient memory and try refreshing page ### Issue 4: Unicode encoding errors **Cause**: Special characters in LLM response **Solution**: Open JSON with UTF-8 encoding ```bash # Windows notepad.exe evaluation_results.json # Then: File > Save As > Encoding: UTF-8 # Or use Python import json with open("evaluation_results.json", "r", encoding="utf-8") as f: data = json.load(f) ``` ## Running Tests ### Automated Test Suite ```bash # Run the comprehensive test python test_llm_audit_trail.py # Should see: # [STEP 1] _get_gpt_labels() returns dict with audit trail # [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info) # [STEP 3] evaluate_batch() stores llm_request in detailed_results # [STEP 4] JSON download includes complete audit trail # [STEP 5] Validation checks # RESULT: ALL TESTS PASSED ``` ### Manual Testing Steps 1. **Test with Single Query** - Run evaluation with 1 sample - Download JSON - Verify llm_request has all fields 2. **Test with Multiple Queries** - Run evaluation with 5 samples - Download JSON - Verify each query has complete llm_request 3. **Test Data Consistency** - Compare llm_request.query with root question field - Compare llm_request.context_documents with retrieved_documents - Verify all strings are non-empty 4. **Test File Size** - Check JSON file is reasonable size (typically 50-500KB for 10 queries) - Verify file opens in text editor without issues ## Success Criteria ✅ All items in checklist above are verified ✅ Test script runs without errors ✅ JSON downloads successfully ✅ llm_request field present in all results ✅ All 9 required fields populated ✅ JSON is valid and well-formed ✅ File can be opened and inspected ✅ Data is consistent across results ## Next Steps After Verification 1. **Review Audit Trail**: Inspect the captured LLM interactions 2. **Validate Quality**: Check if prompts and responses look correct 3. **Test Reproduction**: Use the captured data to reproduce evaluations if needed 4. **Archive Results**: Store JSON for compliance/auditing purposes 5. **Iterate**: Use insights from audit trail to improve prompts if needed ## Support If you encounter issues: 1. Check error messages in Streamlit console 2. Review LLMAUDITTRAIL_CHANGES.md for implementation details 3. Run test_llm_audit_trail.py for automated diagnostics 4. Check CODE_CHANGES_REFERENCE.md for code-level details