Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 8,888 Bytes

1d10b0a

# Verification Guide: LLM Audit Trail Feature

## Quick Start: How to Test the Implementation

### Step 1: Start the Application
```bash
cd "d:\CapStoneProject\RAG Capstone Project"
streamlit run streamlit_app.py
```

### Step 2: Run an Evaluation
1. Select **RAGBench** dataset
2. Choose **GPT Labeling** or **Hybrid** evaluation method
3. Set a small sample count (1-3 for testing)
4. Click "Start Evaluation"
5. Wait for evaluation to complete

### Step 3: Download Results
1. Scroll to "💾 Download Results" section
2. Click "📥 Download Complete Results (JSON)" button
3. Save the file to your computer

### Step 4: Inspect the JSON
Open the downloaded JSON file with a text editor and verify:

```json
{
  "evaluation_metadata": {...},
  "aggregate_metrics": {...},
  "detailed_results": [
    {
      "query_id": 1,
      "question": "...",
      "llm_request": {
        "system_prompt": "You are an expert RAG evaluator...",
        "query": "...",
        "context_documents": ["doc1", "doc2", ...],
        "llm_response": "...",
        "labeling_prompt": "...",
        "model": "groq-default",
        "temperature": 0.0,
        "max_tokens": 2048,
        "full_llm_response": "..."
      }
    }
  ]
}
```

## Verification Checklist

### Code-Level Verification

```bash
# 1. Check for syntax errors
python -m py_compile advanced_rag_evaluator.py
python -m py_compile evaluation_pipeline.py

# 2. Run the test script
python test_llm_audit_trail.py

# Expected output should show:
# ======================================================================
# RESULT: ALL TESTS PASSED
# ======================================================================
```

### JSON Structure Verification

The downloaded JSON should contain:

- [ ] `evaluation_metadata` with timestamp, dataset, method, total_samples
- [ ] `aggregate_metrics` with main metrics
- [ ] `rmse_metrics` if available
- [ ] `auc_metrics` if available
- [ ] `detailed_results` array with multiple query results
- [ ] Each detailed_result contains:
  - [ ] `query_id`: Integer starting from 1
  - [ ] `question`: The user's question
  - [ ] `llm_response`: The LLM's response
  - [ ] `retrieved_documents`: Array of context documents
  - [ ] `metrics`: Dictionary with metric scores
  - [ ] `ground_truth_scores`: Dictionary with ground truth values
  - [ ] `llm_request`: Dictionary containing:
    - [ ] `system_prompt`: System instruction (non-empty string)
    - [ ] `query`: User question (matches `question` field)
    - [ ] `context_documents`: Array of documents (matches `retrieved_documents`)
    - [ ] `llm_response`: Original response (matches `llm_response` field)
    - [ ] `labeling_prompt`: Generated prompt (non-empty string)
    - [ ] `model`: Model name (e.g., "groq-default")
    - [ ] `temperature`: Should be 0.0
    - [ ] `max_tokens`: Should be 2048
    - [ ] `full_llm_response`: Complete raw response (non-empty string)

### Functional Verification

**Test 1: Basic Functionality**
```python
from advanced_rag_evaluator import AdvancedRAGEvaluator

evaluator = AdvancedRAGEvaluator(llm_client=client, ...)
test_case = {
    "query": "What is AI?",
    "response": "AI is artificial intelligence...",
    "retrieved_documents": ["AI doc 1", "AI doc 2"]
}

# Should return dict with "detailed_results" containing "llm_request"
result = evaluator.evaluate_batch([test_case])
assert "detailed_results" in result
assert "llm_request" in result["detailed_results"][0]
assert "system_prompt" in result["detailed_results"][0]["llm_request"]
print("[PASS] LLM audit trail is stored correctly")
```

**Test 2: JSON Serialization**
```python
import json

# Download JSON and verify it's valid
with open("evaluation_results.json", "r") as f:
    data = json.load(f)

# Verify structure
assert "detailed_results" in data
for result in data["detailed_results"]:
    assert "llm_request" in result
    assert result["llm_request"].get("system_prompt")
    assert result["llm_request"].get("query")
    assert result["llm_request"].get("context_documents")
    assert result["llm_request"].get("full_llm_response")
print("[PASS] JSON structure is valid and complete")
```

**Test 3: Backwards Compatibility**
```python
# Old code should still work
result = evaluator.evaluate(
    question="What is AI?",
    response="AI is...",
    retrieved_documents=["doc1", "doc2"]
)

# New code returns tuple
scores, llm_info = result
assert scores is not None
assert isinstance(llm_info, dict)
print("[PASS] Backwards compatible tuple unpacking works")
```

## Expected Results

When you download the JSON and inspect it:

1. **LLM Request Field Present**: Each query result contains a complete `llm_request` object
2. **All 9 Fields Present**: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response)
3. **Data Consistency**: Values in `llm_request` match corresponding fields in the query result
4. **JSON Valid**: File is valid JSON that can be parsed and inspected
5. **Complete Audit Trail**: Full visibility into what was sent to LLM and what it returned

## What Each Field Represents

| Field | Value | Purpose |
|-------|-------|---------|
| `system_prompt` | "You are an expert RAG evaluator..." | System instruction given to LLM for labeling |
| `query` | "What is artificial intelligence?" | The user's question being evaluated |
| `context_documents` | Array of document strings | Retrieved context documents provided to LLM |
| `llm_response` | "AI is the simulation..." | Original LLM response being evaluated |
| `labeling_prompt` | Long prompt text | Generated prompt with instructions for labeling |
| `model` | "groq-default" | Which LLM model was used |
| `temperature` | 0.0 | Temperature setting (0 = deterministic) |
| `max_tokens` | 2048 | Token limit used for LLM call |
| `full_llm_response` | Complete raw response | Exact response from LLM before JSON parsing |

## Common Issues and Solutions

### Issue 1: llm_request field is empty/missing
**Cause**: LLM client not available or failed
**Solution**: Ensure Groq API key is configured and network is available

### Issue 2: Context documents empty in llm_request
**Cause**: Documents not retrieved properly
**Solution**: Check that document retrieval is working in evaluation pipeline

### Issue 3: JSON file not downloading
**Cause**: Large file size or Streamlit issue
**Solution**: Ensure browser has sufficient memory and try refreshing page

### Issue 4: Unicode encoding errors
**Cause**: Special characters in LLM response
**Solution**: Open JSON with UTF-8 encoding
```bash
# Windows
notepad.exe evaluation_results.json
# Then: File > Save As > Encoding: UTF-8

# Or use Python
import json
with open("evaluation_results.json", "r", encoding="utf-8") as f:
    data = json.load(f)
```

## Running Tests

### Automated Test Suite
```bash
# Run the comprehensive test
python test_llm_audit_trail.py

# Should see:
# [STEP 1] _get_gpt_labels() returns dict with audit trail
# [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info)
# [STEP 3] evaluate_batch() stores llm_request in detailed_results
# [STEP 4] JSON download includes complete audit trail
# [STEP 5] Validation checks
# RESULT: ALL TESTS PASSED
```

### Manual Testing Steps

1. **Test with Single Query**
   - Run evaluation with 1 sample
   - Download JSON
   - Verify llm_request has all fields

2. **Test with Multiple Queries**
   - Run evaluation with 5 samples
   - Download JSON
   - Verify each query has complete llm_request

3. **Test Data Consistency**
   - Compare llm_request.query with root question field
   - Compare llm_request.context_documents with retrieved_documents
   - Verify all strings are non-empty

4. **Test File Size**
   - Check JSON file is reasonable size (typically 50-500KB for 10 queries)
   - Verify file opens in text editor without issues

## Success Criteria

✅ All items in checklist above are verified
✅ Test script runs without errors
✅ JSON downloads successfully
✅ llm_request field present in all results
✅ All 9 required fields populated
✅ JSON is valid and well-formed
✅ File can be opened and inspected
✅ Data is consistent across results

## Next Steps After Verification

1. **Review Audit Trail**: Inspect the captured LLM interactions
2. **Validate Quality**: Check if prompts and responses look correct
3. **Test Reproduction**: Use the captured data to reproduce evaluations if needed
4. **Archive Results**: Store JSON for compliance/auditing purposes
5. **Iterate**: Use insights from audit trail to improve prompts if needed

## Support

If you encounter issues:
1. Check error messages in Streamlit console
2. Review LLMAUDITTRAIL_CHANGES.md for implementation details
3. Run test_llm_audit_trail.py for automated diagnostics
4. Check CODE_CHANGES_REFERENCE.md for code-level details