Spaces:

gopikrishnait
/

CapStoneRAG10

Running

App Files Files Community

CapStoneRAG10 / docs /VERIFICATION_GUIDE.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a 30 days ago

preview code

raw

history blame contribute delete

8.89 kB

	# Verification Guide: LLM Audit Trail Feature

	## Quick Start: How to Test the Implementation

	### Step 1: Start the Application
	```bash
	cd "d:\CapStoneProject\RAG Capstone Project"
	streamlit run streamlit_app.py
	```

	### Step 2: Run an Evaluation
	1. Select RAGBench dataset
	2. Choose GPT Labeling or Hybrid evaluation method
	3. Set a small sample count (1-3 for testing)
	4. Click "Start Evaluation"
	5. Wait for evaluation to complete

	### Step 3: Download Results
	1. Scroll to "💾 Download Results" section
	2. Click "📥 Download Complete Results (JSON)" button
	3. Save the file to your computer

	### Step 4: Inspect the JSON
	Open the downloaded JSON file with a text editor and verify:

	```json
	{
	"evaluation_metadata": {...},
	"aggregate_metrics": {...},
	"detailed_results": [
	{
	"query_id": 1,
	"question": "...",
	"llm_request": {
	"system_prompt": "You are an expert RAG evaluator...",
	"query": "...",
	"context_documents": ["doc1", "doc2", ...],
	"llm_response": "...",
	"labeling_prompt": "...",
	"model": "groq-default",
	"temperature": 0.0,
	"max_tokens": 2048,
	"full_llm_response": "..."
	}
	}
	]
	}
	```

	## Verification Checklist

	### Code-Level Verification

	```bash
	# 1. Check for syntax errors
	python -m py_compile advanced_rag_evaluator.py
	python -m py_compile evaluation_pipeline.py

	# 2. Run the test script
	python test_llm_audit_trail.py

	# Expected output should show:
	# ======================================================================
	# RESULT: ALL TESTS PASSED
	# ======================================================================
	```

	### JSON Structure Verification

	The downloaded JSON should contain:

	- [ ] `evaluation_metadata` with timestamp, dataset, method, total_samples
	- [ ] `aggregate_metrics` with main metrics
	- [ ] `rmse_metrics` if available
	- [ ] `auc_metrics` if available
	- [ ] `detailed_results` array with multiple query results
	- [ ] Each detailed_result contains:
	- [ ] `query_id`: Integer starting from 1
	- [ ] `question`: The user's question
	- [ ] `llm_response`: The LLM's response
	- [ ] `retrieved_documents`: Array of context documents
	- [ ] `metrics`: Dictionary with metric scores
	- [ ] `ground_truth_scores`: Dictionary with ground truth values
	- [ ] `llm_request`: Dictionary containing:
	- [ ] `system_prompt`: System instruction (non-empty string)
	- [ ] `query`: User question (matches `question` field)
	- [ ] `context_documents`: Array of documents (matches `retrieved_documents`)
	- [ ] `llm_response`: Original response (matches `llm_response` field)
	- [ ] `labeling_prompt`: Generated prompt (non-empty string)
	- [ ] `model`: Model name (e.g., "groq-default")
	- [ ] `temperature`: Should be 0.0
	- [ ] `max_tokens`: Should be 2048
	- [ ] `full_llm_response`: Complete raw response (non-empty string)

	### Functional Verification

	Test 1: Basic Functionality
	```python
	from advanced_rag_evaluator import AdvancedRAGEvaluator

	evaluator = AdvancedRAGEvaluator(llm_client=client, ...)
	test_case = {
	"query": "What is AI?",
	"response": "AI is artificial intelligence...",
	"retrieved_documents": ["AI doc 1", "AI doc 2"]
	}

	# Should return dict with "detailed_results" containing "llm_request"
	result = evaluator.evaluate_batch([test_case])
	assert "detailed_results" in result
	assert "llm_request" in result["detailed_results"][0]
	assert "system_prompt" in result["detailed_results"][0]["llm_request"]
	print("[PASS] LLM audit trail is stored correctly")
	```

	Test 2: JSON Serialization
	```python
	import json

	# Download JSON and verify it's valid
	with open("evaluation_results.json", "r") as f:
	data = json.load(f)

	# Verify structure
	assert "detailed_results" in data
	for result in data["detailed_results"]:
	assert "llm_request" in result
	assert result["llm_request"].get("system_prompt")
	assert result["llm_request"].get("query")
	assert result["llm_request"].get("context_documents")
	assert result["llm_request"].get("full_llm_response")
	print("[PASS] JSON structure is valid and complete")
	```

	Test 3: Backwards Compatibility
	```python
	# Old code should still work
	result = evaluator.evaluate(
	question="What is AI?",
	response="AI is...",
	retrieved_documents=["doc1", "doc2"]
	)

	# New code returns tuple
	scores, llm_info = result
	assert scores is not None
	assert isinstance(llm_info, dict)
	print("[PASS] Backwards compatible tuple unpacking works")
	```

	## Expected Results

	When you download the JSON and inspect it:

	1. LLM Request Field Present: Each query result contains a complete `llm_request` object
	2. All 9 Fields Present: All required fields (system_prompt, query, context_documents, llm_response, labeling_prompt, model, temperature, max_tokens, full_llm_response)
	3. Data Consistency: Values in `llm_request` match corresponding fields in the query result
	4. JSON Valid: File is valid JSON that can be parsed and inspected
	5. Complete Audit Trail: Full visibility into what was sent to LLM and what it returned

	## What Each Field Represents

	\| Field \| Value \| Purpose \|
	\|-------\|-------\|---------\|
	\| `system_prompt` \| "You are an expert RAG evaluator..." \| System instruction given to LLM for labeling \|
	\| `query` \| "What is artificial intelligence?" \| The user's question being evaluated \|
	\| `context_documents` \| Array of document strings \| Retrieved context documents provided to LLM \|
	\| `llm_response` \| "AI is the simulation..." \| Original LLM response being evaluated \|
	\| `labeling_prompt` \| Long prompt text \| Generated prompt with instructions for labeling \|
	\| `model` \| "groq-default" \| Which LLM model was used \|
	\| `temperature` \| 0.0 \| Temperature setting (0 = deterministic) \|
	\| `max_tokens` \| 2048 \| Token limit used for LLM call \|
	\| `full_llm_response` \| Complete raw response \| Exact response from LLM before JSON parsing \|

	## Common Issues and Solutions

	### Issue 1: llm_request field is empty/missing
	Cause: LLM client not available or failed
	Solution: Ensure Groq API key is configured and network is available

	### Issue 2: Context documents empty in llm_request
	Cause: Documents not retrieved properly
	Solution: Check that document retrieval is working in evaluation pipeline

	### Issue 3: JSON file not downloading
	Cause: Large file size or Streamlit issue
	Solution: Ensure browser has sufficient memory and try refreshing page

	### Issue 4: Unicode encoding errors
	Cause: Special characters in LLM response
	Solution: Open JSON with UTF-8 encoding
	```bash
	# Windows
	notepad.exe evaluation_results.json
	# Then: File > Save As > Encoding: UTF-8

	# Or use Python
	import json
	with open("evaluation_results.json", "r", encoding="utf-8") as f:
	data = json.load(f)
	```

	## Running Tests

	### Automated Test Suite
	```bash
	# Run the comprehensive test
	python test_llm_audit_trail.py

	# Should see:
	# [STEP 1] _get_gpt_labels() returns dict with audit trail
	# [STEP 2] evaluate() unpacks tuple and returns (scores, llm_info)
	# [STEP 3] evaluate_batch() stores llm_request in detailed_results
	# [STEP 4] JSON download includes complete audit trail
	# [STEP 5] Validation checks
	# RESULT: ALL TESTS PASSED
	```

	### Manual Testing Steps

	1. Test with Single Query
	- Run evaluation with 1 sample
	- Download JSON
	- Verify llm_request has all fields

	2. Test with Multiple Queries
	- Run evaluation with 5 samples
	- Download JSON
	- Verify each query has complete llm_request

	3. Test Data Consistency
	- Compare llm_request.query with root question field
	- Compare llm_request.context_documents with retrieved_documents
	- Verify all strings are non-empty

	4. Test File Size
	- Check JSON file is reasonable size (typically 50-500KB for 10 queries)
	- Verify file opens in text editor without issues

	## Success Criteria

	✅ All items in checklist above are verified
	✅ Test script runs without errors
	✅ JSON downloads successfully
	✅ llm_request field present in all results
	✅ All 9 required fields populated
	✅ JSON is valid and well-formed
	✅ File can be opened and inspected
	✅ Data is consistent across results

	## Next Steps After Verification

	1. Review Audit Trail: Inspect the captured LLM interactions
	2. Validate Quality: Check if prompts and responses look correct
	3. Test Reproduction: Use the captured data to reproduce evaluations if needed
	4. Archive Results: Store JSON for compliance/auditing purposes
	5. Iterate: Use insights from audit trail to improve prompts if needed

	## Support

	If you encounter issues:
	1. Check error messages in Streamlit console
	2. Review LLMAUDITTRAIL_CHANGES.md for implementation details
	3. Run test_llm_audit_trail.py for automated diagnostics
	4. Check CODE_CHANGES_REFERENCE.md for code-level details