# RGB RAG Evaluation Pipeline - Setup & Execution Guide ## Project Structure ``` d:\CapStoneProject\RGB\ ├── src/ │ ├── __init__.py │ ├── llm_client.py # Groq API client with rate limiting │ ├── data_loader.py # RGB dataset loader (4 tasks) │ ├── evaluator.py # Metric calculations (accuracy, rejection, error rates) │ ├── prompts.py # Figure 3 prompt templates │ ├── pipeline.py # Main evaluation pipeline │ └── config.py # Configuration constants ├── data/ │ ├── en_refine.json # Noise robustness + negative rejection data (7.2 MB, 300 samples) │ ├── en_int.json # Information integration data (5.1 MB, 100 samples) │ └── en_fact.json # Counterfactual robustness data (241 KB, 100 samples) ├── .venv/ # Python 3.13.1 virtual environment ├── requirements.txt # Dependencies (groq, tqdm, requests, pandas, pymupdf) ├── .env.example # Template for GROQ_API_KEY ├── quick_test.py # Quick connectivity test ├── run_evaluation.py # Main evaluation script ├── download_datasets.py # Dataset download script (already executed) ├── test_refactored_pipeline.py # Comprehensive test suite ├── COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made ├── DETAILED_CHANGES.md # Detailed code changes per file └── README.md # Original project documentation ``` ## Setup Instructions ### Step 1: Get Groq API Key 1. Visit https://console.groq.com/keys 2. Sign up (free tier available) 3. Create API key for your account 4. Copy the key to your clipboard ### Step 2: Create .env File Create file: `d:\CapStoneProject\RGB\.env` ``` GROQ_API_KEY=your_api_key_here ``` Replace `your_api_key_here` with your actual Groq API key. ### Step 3: Verify Setup Run quick test to verify everything is connected: ```bash cd d:\CapStoneProject\RGB .venv\Scripts\python.exe quick_test.py ``` Expected output: ``` ✓ GROQ_API_KEY found ✓ Groq client initialized ✓ Testing LLM connectivity... ✓ Models available: 3 ✓ Ready for evaluation! ``` ## Evaluation Execution ### Option 1: Quick Test (5 samples per task) For testing the complete pipeline: ```bash .venv\Scripts\python.exe run_evaluation.py --max-samples 5 ``` This will: - Test all 4 RAG abilities - Use the 3 default models - Take ~10-20 seconds per model - Output results to `results/evaluation_results.json` ### Option 2: Medium Test (20 samples per task) For validating results: ```bash .venv\Scripts\python.exe run_evaluation.py --max-samples 20 ``` This will: - Test all 4 RAG abilities - Use 20 samples per noise ratio (100+ total for noise robustness) - Take 2-5 minutes per model - Generate reliable metrics ### Option 3: Full Evaluation (300 samples per task) For final results matching paper: ```bash .venv\Scripts\python.exe run_evaluation.py --max-samples 300 ``` This will: - Use all available samples - Test at 5 noise ratios with full data - Take 10-30 minutes per model - Match paper's Table 1-7 methodology ## Understanding the Results ### Output File: results/evaluation_results.json Example structure: ```json { "timestamp": "2024-01-15 10:30:45", "models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"], "results": [ { "task_type": "noise_robustness_0%", "model_name": "llama-3.3-70b-versatile", "total_samples": 300, "correct": 285, "incorrect": 15, "accuracy": 95.00, "rejected": 0, "rejection_rate": 0.00 }, { "task_type": "noise_robustness_20%", "model_name": "llama-3.3-70b-versatile", "total_samples": 300, "correct": 265, "incorrect": 35, "accuracy": 88.33, "rejected": 0, "rejection_rate": 0.00 }, // ... more results for each noise ratio and task { "task_type": "negative_rejection", "model_name": "llama-3.3-70b-versatile", "total_samples": 300, "rejected": 285, "rejection_rate": 95.00 }, { "task_type": "counterfactual_robustness", "model_name": "llama-3.3-70b-versatile", "total_samples": 100, "error_detection_count": 85, "error_detection_rate": 85.00, "error_correction_count": 80, "error_correction_rate": 80.00 } ] } ``` ### Metric Definitions **Accuracy (%)**: Percentage of correct answers - Used for: Noise Robustness, Information Integration - Formula: (correct_answers / total_samples) × 100 **Rejection Rate (%)**: Percentage of appropriate rejections - Used for: Negative Rejection - Formula: (rejected_responses / total_samples) × 100 - Expected: ~95%+ for good models **Error Detection Rate (%)**: Percentage of factual errors detected - Used for: Counterfactual Robustness - Formula: (detected_errors / total_samples) × 100 - Expected: 70%+ for good models **Error Correction Rate (%)**: Percentage of detected errors corrected - Used for: Counterfactual Robustness - Formula: (corrected_errors / detected_errors) × 100 - Expected: 90%+ for good models ## Performance Notes ### Groq Free Tier Limits - API rate limit: ~30 requests per minute - Response time: 100-500ms per request - Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b ### Estimated Evaluation Times | Task | Samples | Time per Model | |------|---------|----------------| | Quick Test (5) | 25 total | 30-60 seconds | | Medium Test (20) | 100 total | 2-5 minutes | | Full Evaluation (300) | 1500 total | 15-30 minutes | ### Model Recommendations 1. **llama-3.3-70b-versatile** (Best quality, slowest) - Best for: Most accurate evaluations - Speed: ~300ms per request 2. **llama-3.1-8b-instant** (Fast, good quality) - Best for: Quick testing and validation - Speed: ~100ms per request 3. **mixtral-8x7b-32768** (Balanced) - Best for: Production evaluations - Speed: ~200ms per request ## Troubleshooting ### Issue: GROQ_API_KEY not found **Solution:** - Create `.env` file with `GROQ_API_KEY=your_key` - Or set environment variable: `$env:GROQ_API_KEY='your_key'` ### Issue: Rate limit errors **Solution:** - LLM client has built-in rate limiting (0.5s between requests) - If still occurring, reduce `--max-samples` parameter - Use smaller models (llama-3.1-8b-instant) for testing ### Issue: Out of memory **Solution:** - Run with smaller batch: `--max-samples 50` - Use faster model: llama-3.1-8b-instant ### Issue: Type errors **Solution:** - Verify Python 3.10+ is being used - Reinstall dependencies: `pip install -r requirements.txt` ## Validation Against Paper After running evaluation, compare results with RGB benchmark paper Table 1-7: **Table 1: Noise Robustness** (Page 8) - Expected: Accuracy decreases as noise ratio increases (0% → 80%) - Validate: Compare your accuracy curve with paper's **Table 2: Negative Rejection** (Page 8) - Expected: Rejection rate ~95%+ for good models - Validate: Check rejection_rate metric **Table 3: Information Integration** (Page 9) - Expected: Accuracy 85%+ for good models - Validate: Compare integrated answer accuracy **Table 4: Counterfactual Robustness** (Page 9) - Expected: Error detection 70%+, error correction 80%+ - Validate: Check both rates in results ## Next Steps 1. ✅ Setup complete: Create `.env` with Groq API key 2. ✅ Run quick test: `python run_evaluation.py --max-samples 5` 3. ✅ Validate results match expected ranges 4. ✅ Run full evaluation: `python run_evaluation.py --max-samples 300` 5. ✅ Compare final results with paper's Table 1-7 ## Project Compliance Summary ✅ **4 RAG Abilities Implemented:** - Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%) - Negative Rejection (exact phrase matching) - Information Integration (multi-document synthesis) - Counterfactual Robustness (error detection & correction) ✅ **3+ LLM Models Supported:** - llama-3.3-70b-versatile (highest quality) - llama-3.1-8b-instant (fastest) - mixtral-8x7b-32768 (balanced) ✅ **Exact Figure 3 Prompt Format:** - System instruction specifying task behavior - Input format: "Document:\n{documents}\n\nQuestion: {question}" ✅ **All Required Metrics:** - Accuracy (for noise robustness, information integration) - Rejection rate (for negative rejection) - Error detection & correction rates (for counterfactual robustness) ✅ **Type-Safe & Well-Tested:** - No type errors or warnings - Comprehensive test suite included - All components validated --- **Status: Ready for evaluation with Groq free API tier** For questions or issues, refer to: - `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes - `DETAILED_CHANGES.md` - Specific code modifications - `README.md` - Original project documentation