CapStoneRAG10 / docs /VERIFICATION_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

Verification Guide: LLM Audit Trail Feature

Quick Start: How to Test the Implementation

Step 1: Start the Application

cd "d:\CapStoneProject\RAG Capstone Project"
streamlit run streamlit_app.py

Step 2: Run an Evaluation

  1. Select RAGBench dataset
  2. Choose GPT Labeling or Hybrid evaluation method
  3. Set a small sample count (1-3 for testing)
  4. Click "Start Evaluation"
  5. Wait for evaluation to complete

Step 3: Download Results

  1. Scroll to "πŸ’Ύ Download Results" section
  2. Click "πŸ“₯ Download Complete Results (JSON)" button
  3. Save the file to your computer

Step 4: Inspect the JSON

Open the downloaded JSON file with a text editor and verify:

{
  "evaluation_metadata": {...},
  "aggregate_metrics": {...},
  "detailed_results": [
    {
      "query_id": 1,
      "question": "...",
      "llm_request": {
        "system_prompt": "You are an expert RAG evaluator...",
        "query": "...",
        "context_documents": ["doc1", "doc2", ...],
        "llm_response": "...",
        "labeling_prompt": "...",
        "model": "groq-default",
        "temperature": 0.0,
        "max_tokens": 2048,
        "full_llm_response": "..."
      }
    }
  ]
}

Verification Checklist

Code-Level Verification

# 1. Check for syntax errors
python -m py_compile advanced_rag_evaluator.py
python -m py_compile evaluation_pipeline.py

# 2. Run the test script
python test_llm_audit_trail.py

# Expected output should show:
# ======================================================================
# RESULT: ALL TESTS PASSED
# ======================================================================

JSON Structure Verification

The downloaded JSON should contain:

  • evaluation_metadata with timestamp, dataset, method, total_samples
  • aggregate_metrics with main metrics
  • rmse_metrics if available
  • auc_metrics if available
  • detailed_results array with multiple query results
  • Each detailed_result contains:
    • query_id: Integer starting from 1
    • question: The user's question
    • llm_response: The LLM's response
    • retrieved_documents: Array of context documents
    • metrics: Dictionary with metric scores
    • ground_truth_scores: Dictionary with ground truth values
    • llm_request: Dictionary containing:
      • system_prompt: System instruction (non-empty string)
      • query: User question (matches question field)
      • context_documents: Array of documents (matches retrieved_documents)
      • llm_response: Original response (matches llm_response field)
      • labeling_prompt: Generated prompt (non-empty string)
      • model: Model name (e.g., "groq-default")
      • temperature: Should be 0.0
      • max_tokens: Should be 2048
      • full_llm_response: Complete raw response (non-empty string)

Functional Verification

Test 1: Basic Functionality

from advanced_rag_evaluator import AdvancedRAGEvaluator

evaluator = AdvancedRAGEvaluator(llm_client=client, ...)
test_case = {
    "query": "What is AI?",
    "response": "AI is artificial intelligence...",
    "retrieved_documents": ["AI doc 1", "AI doc 2"]
}

# Should return dict with "detailed_results" containing "llm_request"
result = evaluator.evaluate_batch([test_case])
assert "detailed_results" in result
assert "llm_request" in result["detailed_results"][0]
assert "system_prompt" in result["detailed_results"][0]["llm_request"]
print("[PASS] LLM audit trail is stored correctly")

Test 2: JSON Serialization

import json

# Download JSON and verify it's valid
with open("evaluation_results.json", "r") as f:
    data = json.load(f)

# Verify structure
assert "detailed_results" in data
for result in data["detailed_results"]:
    assert "llm_request" in result
    assert result["llm_request"].get("system_prompt")
    assert result["llm_request"].get("query")
    assert result["llm_request"].get("context_documents")
    assert result["llm_request"].get("full_llm_response")
print("[PASS] JSON structure is valid and complete")

Test 3: Backwards Compatibility

# Old code should still work
result = evaluator.evaluate(
    question="What is AI?",
    response="AI is...",
    retrieved_documents=["doc1", "doc2"]
)

# New code returns tuple
scores, llm_info = result
assert scores is not None
assert isinstance(llm_info, dict)
print("[PASS] Backwards compatible tuple unpacking works")

Expected Results

When you download the JSON and inspect it:

  1. LLM Request Field Present: Each query result contains a complete llm_request object
  2. All 9 Fields Present: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response)
  3. Data Consistency: Values in llm_request match corresponding fields in the query result
  4. JSON Valid: File is valid JSON that can be parsed and inspected
  5. Complete Audit Trail: Full visibility into what was sent to LLM and what it returned

What Each Field Represents

Field Value Purpose
system_prompt "You are an expert RAG evaluator..." System instruction given to LLM for labeling
query "What is artificial intelligence?" The user's question being evaluated
context_documents Array of document strings Retrieved context documents provided to LLM
llm_response "AI is the simulation..." Original LLM response being evaluated
labeling_prompt Long prompt text Generated prompt with instructions for labeling
model "groq-default" Which LLM model was used
temperature 0.0 Temperature setting (0 = deterministic)
max_tokens 2048 Token limit used for LLM call
full_llm_response Complete raw response Exact response from LLM before JSON parsing

Common Issues and Solutions

Issue 1: llm_request field is empty/missing

Cause: LLM client not available or failed Solution: Ensure Groq API key is configured and network is available

Issue 2: Context documents empty in llm_request

Cause: Documents not retrieved properly Solution: Check that document retrieval is working in evaluation pipeline

Issue 3: JSON file not downloading

Cause: Large file size or Streamlit issue Solution: Ensure browser has sufficient memory and try refreshing page

Issue 4: Unicode encoding errors

Cause: Special characters in LLM response Solution: Open JSON with UTF-8 encoding

# Windows
notepad.exe evaluation_results.json
# Then: File > Save As > Encoding: UTF-8

# Or use Python
import json
with open("evaluation_results.json", "r", encoding="utf-8") as f:
    data = json.load(f)

Running Tests

Automated Test Suite

# Run the comprehensive test
python test_llm_audit_trail.py

# Should see:
# [STEP 1] _get_gpt_labels() returns dict with audit trail
# [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info)
# [STEP 3] evaluate_batch() stores llm_request in detailed_results
# [STEP 4] JSON download includes complete audit trail
# [STEP 5] Validation checks
# RESULT: ALL TESTS PASSED

Manual Testing Steps

  1. Test with Single Query

    • Run evaluation with 1 sample
    • Download JSON
    • Verify llm_request has all fields
  2. Test with Multiple Queries

    • Run evaluation with 5 samples
    • Download JSON
    • Verify each query has complete llm_request
  3. Test Data Consistency

    • Compare llm_request.query with root question field
    • Compare llm_request.context_documents with retrieved_documents
    • Verify all strings are non-empty
  4. Test File Size

    • Check JSON file is reasonable size (typically 50-500KB for 10 queries)
    • Verify file opens in text editor without issues

Success Criteria

βœ… All items in checklist above are verified βœ… Test script runs without errors βœ… JSON downloads successfully βœ… llm_request field present in all results βœ… All 9 required fields populated βœ… JSON is valid and well-formed βœ… File can be opened and inspected βœ… Data is consistent across results

Next Steps After Verification

  1. Review Audit Trail: Inspect the captured LLM interactions
  2. Validate Quality: Check if prompts and responses look correct
  3. Test Reproduction: Use the captured data to reproduce evaluations if needed
  4. Archive Results: Store JSON for compliance/auditing purposes
  5. Iterate: Use insights from audit trail to improve prompts if needed

Support

If you encounter issues:

  1. Check error messages in Streamlit console
  2. Review LLMAUDITTRAIL_CHANGES.md for implementation details
  3. Run test_llm_audit_trail.py for automated diagnostics
  4. Check CODE_CHANGES_REFERENCE.md for code-level details