Spaces:
Running
Running
| # Verification Guide: LLM Audit Trail Feature | |
| ## Quick Start: How to Test the Implementation | |
| ### Step 1: Start the Application | |
| ```bash | |
| cd "d:\CapStoneProject\RAG Capstone Project" | |
| streamlit run streamlit_app.py | |
| ``` | |
| ### Step 2: Run an Evaluation | |
| 1. Select **RAGBench** dataset | |
| 2. Choose **GPT Labeling** or **Hybrid** evaluation method | |
| 3. Set a small sample count (1-3 for testing) | |
| 4. Click "Start Evaluation" | |
| 5. Wait for evaluation to complete | |
| ### Step 3: Download Results | |
| 1. Scroll to "πΎ Download Results" section | |
| 2. Click "π₯ Download Complete Results (JSON)" button | |
| 3. Save the file to your computer | |
| ### Step 4: Inspect the JSON | |
| Open the downloaded JSON file with a text editor and verify: | |
| ```json | |
| { | |
| "evaluation_metadata": {...}, | |
| "aggregate_metrics": {...}, | |
| "detailed_results": [ | |
| { | |
| "query_id": 1, | |
| "question": "...", | |
| "llm_request": { | |
| "system_prompt": "You are an expert RAG evaluator...", | |
| "query": "...", | |
| "context_documents": ["doc1", "doc2", ...], | |
| "llm_response": "...", | |
| "labeling_prompt": "...", | |
| "model": "groq-default", | |
| "temperature": 0.0, | |
| "max_tokens": 2048, | |
| "full_llm_response": "..." | |
| } | |
| } | |
| ] | |
| } | |
| ``` | |
| ## Verification Checklist | |
| ### Code-Level Verification | |
| ```bash | |
| # 1. Check for syntax errors | |
| python -m py_compile advanced_rag_evaluator.py | |
| python -m py_compile evaluation_pipeline.py | |
| # 2. Run the test script | |
| python test_llm_audit_trail.py | |
| # Expected output should show: | |
| # ====================================================================== | |
| # RESULT: ALL TESTS PASSED | |
| # ====================================================================== | |
| ``` | |
| ### JSON Structure Verification | |
| The downloaded JSON should contain: | |
| - [ ] `evaluation_metadata` with timestamp, dataset, method, total_samples | |
| - [ ] `aggregate_metrics` with main metrics | |
| - [ ] `rmse_metrics` if available | |
| - [ ] `auc_metrics` if available | |
| - [ ] `detailed_results` array with multiple query results | |
| - [ ] Each detailed_result contains: | |
| - [ ] `query_id`: Integer starting from 1 | |
| - [ ] `question`: The user's question | |
| - [ ] `llm_response`: The LLM's response | |
| - [ ] `retrieved_documents`: Array of context documents | |
| - [ ] `metrics`: Dictionary with metric scores | |
| - [ ] `ground_truth_scores`: Dictionary with ground truth values | |
| - [ ] `llm_request`: Dictionary containing: | |
| - [ ] `system_prompt`: System instruction (non-empty string) | |
| - [ ] `query`: User question (matches `question` field) | |
| - [ ] `context_documents`: Array of documents (matches `retrieved_documents`) | |
| - [ ] `llm_response`: Original response (matches `llm_response` field) | |
| - [ ] `labeling_prompt`: Generated prompt (non-empty string) | |
| - [ ] `model`: Model name (e.g., "groq-default") | |
| - [ ] `temperature`: Should be 0.0 | |
| - [ ] `max_tokens`: Should be 2048 | |
| - [ ] `full_llm_response`: Complete raw response (non-empty string) | |
| ### Functional Verification | |
| **Test 1: Basic Functionality** | |
| ```python | |
| from advanced_rag_evaluator import AdvancedRAGEvaluator | |
| evaluator = AdvancedRAGEvaluator(llm_client=client, ...) | |
| test_case = { | |
| "query": "What is AI?", | |
| "response": "AI is artificial intelligence...", | |
| "retrieved_documents": ["AI doc 1", "AI doc 2"] | |
| } | |
| # Should return dict with "detailed_results" containing "llm_request" | |
| result = evaluator.evaluate_batch([test_case]) | |
| assert "detailed_results" in result | |
| assert "llm_request" in result["detailed_results"][0] | |
| assert "system_prompt" in result["detailed_results"][0]["llm_request"] | |
| print("[PASS] LLM audit trail is stored correctly") | |
| ``` | |
| **Test 2: JSON Serialization** | |
| ```python | |
| import json | |
| # Download JSON and verify it's valid | |
| with open("evaluation_results.json", "r") as f: | |
| data = json.load(f) | |
| # Verify structure | |
| assert "detailed_results" in data | |
| for result in data["detailed_results"]: | |
| assert "llm_request" in result | |
| assert result["llm_request"].get("system_prompt") | |
| assert result["llm_request"].get("query") | |
| assert result["llm_request"].get("context_documents") | |
| assert result["llm_request"].get("full_llm_response") | |
| print("[PASS] JSON structure is valid and complete") | |
| ``` | |
| **Test 3: Backwards Compatibility** | |
| ```python | |
| # Old code should still work | |
| result = evaluator.evaluate( | |
| question="What is AI?", | |
| response="AI is...", | |
| retrieved_documents=["doc1", "doc2"] | |
| ) | |
| # New code returns tuple | |
| scores, llm_info = result | |
| assert scores is not None | |
| assert isinstance(llm_info, dict) | |
| print("[PASS] Backwards compatible tuple unpacking works") | |
| ``` | |
| ## Expected Results | |
| When you download the JSON and inspect it: | |
| 1. **LLM Request Field Present**: Each query result contains a complete `llm_request` object | |
| 2. **All 9 Fields Present**: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response) | |
| 3. **Data Consistency**: Values in `llm_request` match corresponding fields in the query result | |
| 4. **JSON Valid**: File is valid JSON that can be parsed and inspected | |
| 5. **Complete Audit Trail**: Full visibility into what was sent to LLM and what it returned | |
| ## What Each Field Represents | |
| | Field | Value | Purpose | | |
| |-------|-------|---------| | |
| | `system_prompt` | "You are an expert RAG evaluator..." | System instruction given to LLM for labeling | | |
| | `query` | "What is artificial intelligence?" | The user's question being evaluated | | |
| | `context_documents` | Array of document strings | Retrieved context documents provided to LLM | | |
| | `llm_response` | "AI is the simulation..." | Original LLM response being evaluated | | |
| | `labeling_prompt` | Long prompt text | Generated prompt with instructions for labeling | | |
| | `model` | "groq-default" | Which LLM model was used | | |
| | `temperature` | 0.0 | Temperature setting (0 = deterministic) | | |
| | `max_tokens` | 2048 | Token limit used for LLM call | | |
| | `full_llm_response` | Complete raw response | Exact response from LLM before JSON parsing | | |
| ## Common Issues and Solutions | |
| ### Issue 1: llm_request field is empty/missing | |
| **Cause**: LLM client not available or failed | |
| **Solution**: Ensure Groq API key is configured and network is available | |
| ### Issue 2: Context documents empty in llm_request | |
| **Cause**: Documents not retrieved properly | |
| **Solution**: Check that document retrieval is working in evaluation pipeline | |
| ### Issue 3: JSON file not downloading | |
| **Cause**: Large file size or Streamlit issue | |
| **Solution**: Ensure browser has sufficient memory and try refreshing page | |
| ### Issue 4: Unicode encoding errors | |
| **Cause**: Special characters in LLM response | |
| **Solution**: Open JSON with UTF-8 encoding | |
| ```bash | |
| # Windows | |
| notepad.exe evaluation_results.json | |
| # Then: File > Save As > Encoding: UTF-8 | |
| # Or use Python | |
| import json | |
| with open("evaluation_results.json", "r", encoding="utf-8") as f: | |
| data = json.load(f) | |
| ``` | |
| ## Running Tests | |
| ### Automated Test Suite | |
| ```bash | |
| # Run the comprehensive test | |
| python test_llm_audit_trail.py | |
| # Should see: | |
| # [STEP 1] _get_gpt_labels() returns dict with audit trail | |
| # [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info) | |
| # [STEP 3] evaluate_batch() stores llm_request in detailed_results | |
| # [STEP 4] JSON download includes complete audit trail | |
| # [STEP 5] Validation checks | |
| # RESULT: ALL TESTS PASSED | |
| ``` | |
| ### Manual Testing Steps | |
| 1. **Test with Single Query** | |
| - Run evaluation with 1 sample | |
| - Download JSON | |
| - Verify llm_request has all fields | |
| 2. **Test with Multiple Queries** | |
| - Run evaluation with 5 samples | |
| - Download JSON | |
| - Verify each query has complete llm_request | |
| 3. **Test Data Consistency** | |
| - Compare llm_request.query with root question field | |
| - Compare llm_request.context_documents with retrieved_documents | |
| - Verify all strings are non-empty | |
| 4. **Test File Size** | |
| - Check JSON file is reasonable size (typically 50-500KB for 10 queries) | |
| - Verify file opens in text editor without issues | |
| ## Success Criteria | |
| β All items in checklist above are verified | |
| β Test script runs without errors | |
| β JSON downloads successfully | |
| β llm_request field present in all results | |
| β All 9 required fields populated | |
| β JSON is valid and well-formed | |
| β File can be opened and inspected | |
| β Data is consistent across results | |
| ## Next Steps After Verification | |
| 1. **Review Audit Trail**: Inspect the captured LLM interactions | |
| 2. **Validate Quality**: Check if prompts and responses look correct | |
| 3. **Test Reproduction**: Use the captured data to reproduce evaluations if needed | |
| 4. **Archive Results**: Store JSON for compliance/auditing purposes | |
| 5. **Iterate**: Use insights from audit trail to improve prompts if needed | |
| ## Support | |
| If you encounter issues: | |
| 1. Check error messages in Streamlit console | |
| 2. Review LLMAUDITTRAIL_CHANGES.md for implementation details | |
| 3. Run test_llm_audit_trail.py for automated diagnostics | |
| 4. Check CODE_CHANGES_REFERENCE.md for code-level details | |