Spaces:

gopikrishnait
/

CapStoneRAG10

Running

App Files Files Community

CapStoneRAG10 / docs /VERIFICATION_GUIDE.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a 28 days ago

preview code

raw

history blame contribute delete

8.89 kB

Verification Guide: LLM Audit Trail Feature

Quick Start: How to Test the Implementation

Step 1: Start the Application

cd "d:\CapStoneProject\RAG Capstone Project"
streamlit run streamlit_app.py

Step 2: Run an Evaluation

Select RAGBench dataset
Choose GPT Labeling or Hybrid evaluation method
Set a small sample count (1-3 for testing)
Click "Start Evaluation"
Wait for evaluation to complete

Step 3: Download Results

Scroll to "💾 Download Results" section
Click "📥 Download Complete Results (JSON)" button
Save the file to your computer

Step 4: Inspect the JSON

Open the downloaded JSON file with a text editor and verify:

{
  "evaluation_metadata": {...},
  "aggregate_metrics": {...},
  "detailed_results": [
    {
      "query_id": 1,
      "question": "...",
      "llm_request": {
        "system_prompt": "You are an expert RAG evaluator...",
        "query": "...",
        "context_documents": ["doc1", "doc2", ...],
        "llm_response": "...",
        "labeling_prompt": "...",
        "model": "groq-default",
        "temperature": 0.0,
        "max_tokens": 2048,
        "full_llm_response": "..."
      }
    }
  ]
}

Verification Checklist

Code-Level Verification

# 1. Check for syntax errors
python -m py_compile advanced_rag_evaluator.py
python -m py_compile evaluation_pipeline.py

# 2. Run the test script
python test_llm_audit_trail.py

# Expected output should show:
# ======================================================================
# RESULT: ALL TESTS PASSED
# ======================================================================

JSON Structure Verification

The downloaded JSON should contain:

evaluation_metadata with timestamp, dataset, method, total_samples
aggregate_metrics with main metrics
rmse_metrics if available
auc_metrics if available
detailed_results array with multiple query results
Each detailed_result contains:
- query_id: Integer starting from 1
- question: The user's question
- llm_response: The LLM's response
- retrieved_documents: Array of context documents
- metrics: Dictionary with metric scores
- ground_truth_scores: Dictionary with ground truth values
- llm_request: Dictionary containing:
  - system_prompt: System instruction (non-empty string)
  - query: User question (matches question field)
  - context_documents: Array of documents (matches retrieved_documents)
  - llm_response: Original response (matches llm_response field)
  - labeling_prompt: Generated prompt (non-empty string)
  - model: Model name (e.g., "groq-default")
  - temperature: Should be 0.0
  - max_tokens: Should be 2048
  - full_llm_response: Complete raw response (non-empty string)

Functional Verification

Test 1: Basic Functionality

from advanced_rag_evaluator import AdvancedRAGEvaluator

evaluator = AdvancedRAGEvaluator(llm_client=client, ...)
test_case = {
    "query": "What is AI?",
    "response": "AI is artificial intelligence...",
    "retrieved_documents": ["AI doc 1", "AI doc 2"]
}

# Should return dict with "detailed_results" containing "llm_request"
result = evaluator.evaluate_batch([test_case])
assert "detailed_results" in result
assert "llm_request" in result["detailed_results"][0]
assert "system_prompt" in result["detailed_results"][0]["llm_request"]
print("[PASS] LLM audit trail is stored correctly")

Test 2: JSON Serialization

import json

# Download JSON and verify it's valid
with open("evaluation_results.json", "r") as f:
    data = json.load(f)

# Verify structure
assert "detailed_results" in data
for result in data["detailed_results"]:
    assert "llm_request" in result
    assert result["llm_request"].get("system_prompt")
    assert result["llm_request"].get("query")
    assert result["llm_request"].get("context_documents")
    assert result["llm_request"].get("full_llm_response")
print("[PASS] JSON structure is valid and complete")

Test 3: Backwards Compatibility

# Old code should still work
result = evaluator.evaluate(
    question="What is AI?",
    response="AI is...",
    retrieved_documents=["doc1", "doc2"]
)

# New code returns tuple
scores, llm_info = result
assert scores is not None
assert isinstance(llm_info, dict)
print("[PASS] Backwards compatible tuple unpacking works")

Expected Results

When you download the JSON and inspect it:

LLM Request Field Present: Each query result contains a complete llm_request object
All 9 Fields Present: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response)
Data Consistency: Values in llm_request match corresponding fields in the query result
JSON Valid: File is valid JSON that can be parsed and inspected
Complete Audit Trail: Full visibility into what was sent to LLM and what it returned

What Each Field Represents

Field	Value	Purpose
`system_prompt`	"You are an expert RAG evaluator..."	System instruction given to LLM for labeling
`query`	"What is artificial intelligence?"	The user's question being evaluated
`context_documents`	Array of document strings	Retrieved context documents provided to LLM
`llm_response`	"AI is the simulation..."	Original LLM response being evaluated
`labeling_prompt`	Long prompt text	Generated prompt with instructions for labeling
`model`	"groq-default"	Which LLM model was used
`temperature`	0.0	Temperature setting (0 = deterministic)
`max_tokens`	2048	Token limit used for LLM call
`full_llm_response`	Complete raw response	Exact response from LLM before JSON parsing

Common Issues and Solutions

Issue 1: llm_request field is empty/missing

Cause: LLM client not available or failed Solution: Ensure Groq API key is configured and network is available

Issue 2: Context documents empty in llm_request

Cause: Documents not retrieved properly Solution: Check that document retrieval is working in evaluation pipeline

Issue 3: JSON file not downloading

Cause: Large file size or Streamlit issue Solution: Ensure browser has sufficient memory and try refreshing page

Issue 4: Unicode encoding errors

Cause: Special characters in LLM response Solution: Open JSON with UTF-8 encoding

# Windows
notepad.exe evaluation_results.json
# Then: File > Save As > Encoding: UTF-8

# Or use Python
import json
with open("evaluation_results.json", "r", encoding="utf-8") as f:
    data = json.load(f)

Running Tests

Automated Test Suite

# Run the comprehensive test
python test_llm_audit_trail.py

# Should see:
# [STEP 1] _get_gpt_labels() returns dict with audit trail
# [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info)
# [STEP 3] evaluate_batch() stores llm_request in detailed_results
# [STEP 4] JSON download includes complete audit trail
# [STEP 5] Validation checks
# RESULT: ALL TESTS PASSED

Manual Testing Steps

Test with Single Query
- Run evaluation with 1 sample
- Download JSON
- Verify llm_request has all fields
Test with Multiple Queries
- Run evaluation with 5 samples
- Download JSON
- Verify each query has complete llm_request
Test Data Consistency
- Compare llm_request.query with root question field
- Compare llm_request.context_documents with retrieved_documents
- Verify all strings are non-empty
Test File Size
- Check JSON file is reasonable size (typically 50-500KB for 10 queries)
- Verify file opens in text editor without issues

Success Criteria

✅ All items in checklist above are verified ✅ Test script runs without errors ✅ JSON downloads successfully ✅ llm_request field present in all results ✅ All 9 required fields populated ✅ JSON is valid and well-formed ✅ File can be opened and inspected ✅ Data is consistent across results

Next Steps After Verification

Review Audit Trail: Inspect the captured LLM interactions
Validate Quality: Check if prompts and responses look correct
Test Reproduction: Use the captured data to reproduce evaluations if needed
Archive Results: Store JSON for compliance/auditing purposes
Iterate: Use insights from audit trail to improve prompts if needed

Support

If you encounter issues:

Check error messages in Streamlit console
Review LLMAUDITTRAIL_CHANGES.md for implementation details
Run test_llm_audit_trail.py for automated diagnostics
Check CODE_CHANGES_REFERENCE.md for code-level details