CapStoneRAG10 / docs /VERIFICATION_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# Verification Guide: LLM Audit Trail Feature
## Quick Start: How to Test the Implementation
### Step 1: Start the Application
```bash
cd "d:\CapStoneProject\RAG Capstone Project"
streamlit run streamlit_app.py
```
### Step 2: Run an Evaluation
1. Select **RAGBench** dataset
2. Choose **GPT Labeling** or **Hybrid** evaluation method
3. Set a small sample count (1-3 for testing)
4. Click "Start Evaluation"
5. Wait for evaluation to complete
### Step 3: Download Results
1. Scroll to "πŸ’Ύ Download Results" section
2. Click "πŸ“₯ Download Complete Results (JSON)" button
3. Save the file to your computer
### Step 4: Inspect the JSON
Open the downloaded JSON file with a text editor and verify:
```json
{
"evaluation_metadata": {...},
"aggregate_metrics": {...},
"detailed_results": [
{
"query_id": 1,
"question": "...",
"llm_request": {
"system_prompt": "You are an expert RAG evaluator...",
"query": "...",
"context_documents": ["doc1", "doc2", ...],
"llm_response": "...",
"labeling_prompt": "...",
"model": "groq-default",
"temperature": 0.0,
"max_tokens": 2048,
"full_llm_response": "..."
}
}
]
}
```
## Verification Checklist
### Code-Level Verification
```bash
# 1. Check for syntax errors
python -m py_compile advanced_rag_evaluator.py
python -m py_compile evaluation_pipeline.py
# 2. Run the test script
python test_llm_audit_trail.py
# Expected output should show:
# ======================================================================
# RESULT: ALL TESTS PASSED
# ======================================================================
```
### JSON Structure Verification
The downloaded JSON should contain:
- [ ] `evaluation_metadata` with timestamp, dataset, method, total_samples
- [ ] `aggregate_metrics` with main metrics
- [ ] `rmse_metrics` if available
- [ ] `auc_metrics` if available
- [ ] `detailed_results` array with multiple query results
- [ ] Each detailed_result contains:
- [ ] `query_id`: Integer starting from 1
- [ ] `question`: The user's question
- [ ] `llm_response`: The LLM's response
- [ ] `retrieved_documents`: Array of context documents
- [ ] `metrics`: Dictionary with metric scores
- [ ] `ground_truth_scores`: Dictionary with ground truth values
- [ ] `llm_request`: Dictionary containing:
- [ ] `system_prompt`: System instruction (non-empty string)
- [ ] `query`: User question (matches `question` field)
- [ ] `context_documents`: Array of documents (matches `retrieved_documents`)
- [ ] `llm_response`: Original response (matches `llm_response` field)
- [ ] `labeling_prompt`: Generated prompt (non-empty string)
- [ ] `model`: Model name (e.g., "groq-default")
- [ ] `temperature`: Should be 0.0
- [ ] `max_tokens`: Should be 2048
- [ ] `full_llm_response`: Complete raw response (non-empty string)
### Functional Verification
**Test 1: Basic Functionality**
```python
from advanced_rag_evaluator import AdvancedRAGEvaluator
evaluator = AdvancedRAGEvaluator(llm_client=client, ...)
test_case = {
"query": "What is AI?",
"response": "AI is artificial intelligence...",
"retrieved_documents": ["AI doc 1", "AI doc 2"]
}
# Should return dict with "detailed_results" containing "llm_request"
result = evaluator.evaluate_batch([test_case])
assert "detailed_results" in result
assert "llm_request" in result["detailed_results"][0]
assert "system_prompt" in result["detailed_results"][0]["llm_request"]
print("[PASS] LLM audit trail is stored correctly")
```
**Test 2: JSON Serialization**
```python
import json
# Download JSON and verify it's valid
with open("evaluation_results.json", "r") as f:
data = json.load(f)
# Verify structure
assert "detailed_results" in data
for result in data["detailed_results"]:
assert "llm_request" in result
assert result["llm_request"].get("system_prompt")
assert result["llm_request"].get("query")
assert result["llm_request"].get("context_documents")
assert result["llm_request"].get("full_llm_response")
print("[PASS] JSON structure is valid and complete")
```
**Test 3: Backwards Compatibility**
```python
# Old code should still work
result = evaluator.evaluate(
question="What is AI?",
response="AI is...",
retrieved_documents=["doc1", "doc2"]
)
# New code returns tuple
scores, llm_info = result
assert scores is not None
assert isinstance(llm_info, dict)
print("[PASS] Backwards compatible tuple unpacking works")
```
## Expected Results
When you download the JSON and inspect it:
1. **LLM Request Field Present**: Each query result contains a complete `llm_request` object
2. **All 9 Fields Present**: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response)
3. **Data Consistency**: Values in `llm_request` match corresponding fields in the query result
4. **JSON Valid**: File is valid JSON that can be parsed and inspected
5. **Complete Audit Trail**: Full visibility into what was sent to LLM and what it returned
## What Each Field Represents
| Field | Value | Purpose |
|-------|-------|---------|
| `system_prompt` | "You are an expert RAG evaluator..." | System instruction given to LLM for labeling |
| `query` | "What is artificial intelligence?" | The user's question being evaluated |
| `context_documents` | Array of document strings | Retrieved context documents provided to LLM |
| `llm_response` | "AI is the simulation..." | Original LLM response being evaluated |
| `labeling_prompt` | Long prompt text | Generated prompt with instructions for labeling |
| `model` | "groq-default" | Which LLM model was used |
| `temperature` | 0.0 | Temperature setting (0 = deterministic) |
| `max_tokens` | 2048 | Token limit used for LLM call |
| `full_llm_response` | Complete raw response | Exact response from LLM before JSON parsing |
## Common Issues and Solutions
### Issue 1: llm_request field is empty/missing
**Cause**: LLM client not available or failed
**Solution**: Ensure Groq API key is configured and network is available
### Issue 2: Context documents empty in llm_request
**Cause**: Documents not retrieved properly
**Solution**: Check that document retrieval is working in evaluation pipeline
### Issue 3: JSON file not downloading
**Cause**: Large file size or Streamlit issue
**Solution**: Ensure browser has sufficient memory and try refreshing page
### Issue 4: Unicode encoding errors
**Cause**: Special characters in LLM response
**Solution**: Open JSON with UTF-8 encoding
```bash
# Windows
notepad.exe evaluation_results.json
# Then: File > Save As > Encoding: UTF-8
# Or use Python
import json
with open("evaluation_results.json", "r", encoding="utf-8") as f:
data = json.load(f)
```
## Running Tests
### Automated Test Suite
```bash
# Run the comprehensive test
python test_llm_audit_trail.py
# Should see:
# [STEP 1] _get_gpt_labels() returns dict with audit trail
# [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info)
# [STEP 3] evaluate_batch() stores llm_request in detailed_results
# [STEP 4] JSON download includes complete audit trail
# [STEP 5] Validation checks
# RESULT: ALL TESTS PASSED
```
### Manual Testing Steps
1. **Test with Single Query**
- Run evaluation with 1 sample
- Download JSON
- Verify llm_request has all fields
2. **Test with Multiple Queries**
- Run evaluation with 5 samples
- Download JSON
- Verify each query has complete llm_request
3. **Test Data Consistency**
- Compare llm_request.query with root question field
- Compare llm_request.context_documents with retrieved_documents
- Verify all strings are non-empty
4. **Test File Size**
- Check JSON file is reasonable size (typically 50-500KB for 10 queries)
- Verify file opens in text editor without issues
## Success Criteria
βœ… All items in checklist above are verified
βœ… Test script runs without errors
βœ… JSON downloads successfully
βœ… llm_request field present in all results
βœ… All 9 required fields populated
βœ… JSON is valid and well-formed
βœ… File can be opened and inspected
βœ… Data is consistent across results
## Next Steps After Verification
1. **Review Audit Trail**: Inspect the captured LLM interactions
2. **Validate Quality**: Check if prompts and responses look correct
3. **Test Reproduction**: Use the captured data to reproduce evaluations if needed
4. **Archive Results**: Store JSON for compliance/auditing purposes
5. **Iterate**: Use insights from audit trail to improve prompts if needed
## Support
If you encounter issues:
1. Check error messages in Streamlit console
2. Review LLMAUDITTRAIL_CHANGES.md for implementation details
3. Run test_llm_audit_trail.py for automated diagnostics
4. Check CODE_CHANGES_REFERENCE.md for code-level details