Spaces:
Sleeping
Sleeping
| # RGB RAG Evaluation Pipeline - Setup & Execution Guide | |
| ## Project Structure | |
| ``` | |
| d:\CapStoneProject\RGB\ | |
| βββ src/ | |
| β βββ __init__.py | |
| β βββ llm_client.py # Groq API client with rate limiting | |
| β βββ data_loader.py # RGB dataset loader (4 tasks) | |
| β βββ evaluator.py # Metric calculations (accuracy, rejection, error rates) | |
| β βββ prompts.py # Figure 3 prompt templates | |
| β βββ pipeline.py # Main evaluation pipeline | |
| β βββ config.py # Configuration constants | |
| βββ data/ | |
| β βββ en_refine.json # Noise robustness + negative rejection data (7.2 MB, 300 samples) | |
| β βββ en_int.json # Information integration data (5.1 MB, 100 samples) | |
| β βββ en_fact.json # Counterfactual robustness data (241 KB, 100 samples) | |
| βββ .venv/ # Python 3.13.1 virtual environment | |
| βββ requirements.txt # Dependencies (groq, tqdm, requests, pandas, pymupdf) | |
| βββ .env.example # Template for GROQ_API_KEY | |
| βββ quick_test.py # Quick connectivity test | |
| βββ run_evaluation.py # Main evaluation script | |
| βββ download_datasets.py # Dataset download script (already executed) | |
| βββ test_refactored_pipeline.py # Comprehensive test suite | |
| βββ COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made | |
| βββ DETAILED_CHANGES.md # Detailed code changes per file | |
| βββ README.md # Original project documentation | |
| ``` | |
| ## Setup Instructions | |
| ### Step 1: Get Groq API Key | |
| 1. Visit https://console.groq.com/keys | |
| 2. Sign up (free tier available) | |
| 3. Create API key for your account | |
| 4. Copy the key to your clipboard | |
| ### Step 2: Create .env File | |
| Create file: `d:\CapStoneProject\RGB\.env` | |
| ``` | |
| GROQ_API_KEY=your_api_key_here | |
| ``` | |
| Replace `your_api_key_here` with your actual Groq API key. | |
| ### Step 3: Verify Setup | |
| Run quick test to verify everything is connected: | |
| ```bash | |
| cd d:\CapStoneProject\RGB | |
| .venv\Scripts\python.exe quick_test.py | |
| ``` | |
| Expected output: | |
| ``` | |
| β GROQ_API_KEY found | |
| β Groq client initialized | |
| β Testing LLM connectivity... | |
| β Models available: 3 | |
| β Ready for evaluation! | |
| ``` | |
| ## Evaluation Execution | |
| ### Option 1: Quick Test (5 samples per task) | |
| For testing the complete pipeline: | |
| ```bash | |
| .venv\Scripts\python.exe run_evaluation.py --max-samples 5 | |
| ``` | |
| This will: | |
| - Test all 4 RAG abilities | |
| - Use the 3 default models | |
| - Take ~10-20 seconds per model | |
| - Output results to `results/evaluation_results.json` | |
| ### Option 2: Medium Test (20 samples per task) | |
| For validating results: | |
| ```bash | |
| .venv\Scripts\python.exe run_evaluation.py --max-samples 20 | |
| ``` | |
| This will: | |
| - Test all 4 RAG abilities | |
| - Use 20 samples per noise ratio (100+ total for noise robustness) | |
| - Take 2-5 minutes per model | |
| - Generate reliable metrics | |
| ### Option 3: Full Evaluation (300 samples per task) | |
| For final results matching paper: | |
| ```bash | |
| .venv\Scripts\python.exe run_evaluation.py --max-samples 300 | |
| ``` | |
| This will: | |
| - Use all available samples | |
| - Test at 5 noise ratios with full data | |
| - Take 10-30 minutes per model | |
| - Match paper's Table 1-7 methodology | |
| ## Understanding the Results | |
| ### Output File: results/evaluation_results.json | |
| Example structure: | |
| ```json | |
| { | |
| "timestamp": "2024-01-15 10:30:45", | |
| "models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"], | |
| "results": [ | |
| { | |
| "task_type": "noise_robustness_0%", | |
| "model_name": "llama-3.3-70b-versatile", | |
| "total_samples": 300, | |
| "correct": 285, | |
| "incorrect": 15, | |
| "accuracy": 95.00, | |
| "rejected": 0, | |
| "rejection_rate": 0.00 | |
| }, | |
| { | |
| "task_type": "noise_robustness_20%", | |
| "model_name": "llama-3.3-70b-versatile", | |
| "total_samples": 300, | |
| "correct": 265, | |
| "incorrect": 35, | |
| "accuracy": 88.33, | |
| "rejected": 0, | |
| "rejection_rate": 0.00 | |
| }, | |
| // ... more results for each noise ratio and task | |
| { | |
| "task_type": "negative_rejection", | |
| "model_name": "llama-3.3-70b-versatile", | |
| "total_samples": 300, | |
| "rejected": 285, | |
| "rejection_rate": 95.00 | |
| }, | |
| { | |
| "task_type": "counterfactual_robustness", | |
| "model_name": "llama-3.3-70b-versatile", | |
| "total_samples": 100, | |
| "error_detection_count": 85, | |
| "error_detection_rate": 85.00, | |
| "error_correction_count": 80, | |
| "error_correction_rate": 80.00 | |
| } | |
| ] | |
| } | |
| ``` | |
| ### Metric Definitions | |
| **Accuracy (%)**: Percentage of correct answers | |
| - Used for: Noise Robustness, Information Integration | |
| - Formula: (correct_answers / total_samples) Γ 100 | |
| **Rejection Rate (%)**: Percentage of appropriate rejections | |
| - Used for: Negative Rejection | |
| - Formula: (rejected_responses / total_samples) Γ 100 | |
| - Expected: ~95%+ for good models | |
| **Error Detection Rate (%)**: Percentage of factual errors detected | |
| - Used for: Counterfactual Robustness | |
| - Formula: (detected_errors / total_samples) Γ 100 | |
| - Expected: 70%+ for good models | |
| **Error Correction Rate (%)**: Percentage of detected errors corrected | |
| - Used for: Counterfactual Robustness | |
| - Formula: (corrected_errors / detected_errors) Γ 100 | |
| - Expected: 90%+ for good models | |
| ## Performance Notes | |
| ### Groq Free Tier Limits | |
| - API rate limit: ~30 requests per minute | |
| - Response time: 100-500ms per request | |
| - Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b | |
| ### Estimated Evaluation Times | |
| | Task | Samples | Time per Model | | |
| |------|---------|----------------| | |
| | Quick Test (5) | 25 total | 30-60 seconds | | |
| | Medium Test (20) | 100 total | 2-5 minutes | | |
| | Full Evaluation (300) | 1500 total | 15-30 minutes | | |
| ### Model Recommendations | |
| 1. **llama-3.3-70b-versatile** (Best quality, slowest) | |
| - Best for: Most accurate evaluations | |
| - Speed: ~300ms per request | |
| 2. **llama-3.1-8b-instant** (Fast, good quality) | |
| - Best for: Quick testing and validation | |
| - Speed: ~100ms per request | |
| 3. **mixtral-8x7b-32768** (Balanced) | |
| - Best for: Production evaluations | |
| - Speed: ~200ms per request | |
| ## Troubleshooting | |
| ### Issue: GROQ_API_KEY not found | |
| **Solution:** | |
| - Create `.env` file with `GROQ_API_KEY=your_key` | |
| - Or set environment variable: `$env:GROQ_API_KEY='your_key'` | |
| ### Issue: Rate limit errors | |
| **Solution:** | |
| - LLM client has built-in rate limiting (0.5s between requests) | |
| - If still occurring, reduce `--max-samples` parameter | |
| - Use smaller models (llama-3.1-8b-instant) for testing | |
| ### Issue: Out of memory | |
| **Solution:** | |
| - Run with smaller batch: `--max-samples 50` | |
| - Use faster model: llama-3.1-8b-instant | |
| ### Issue: Type errors | |
| **Solution:** | |
| - Verify Python 3.10+ is being used | |
| - Reinstall dependencies: `pip install -r requirements.txt` | |
| ## Validation Against Paper | |
| After running evaluation, compare results with RGB benchmark paper Table 1-7: | |
| **Table 1: Noise Robustness** (Page 8) | |
| - Expected: Accuracy decreases as noise ratio increases (0% β 80%) | |
| - Validate: Compare your accuracy curve with paper's | |
| **Table 2: Negative Rejection** (Page 8) | |
| - Expected: Rejection rate ~95%+ for good models | |
| - Validate: Check rejection_rate metric | |
| **Table 3: Information Integration** (Page 9) | |
| - Expected: Accuracy 85%+ for good models | |
| - Validate: Compare integrated answer accuracy | |
| **Table 4: Counterfactual Robustness** (Page 9) | |
| - Expected: Error detection 70%+, error correction 80%+ | |
| - Validate: Check both rates in results | |
| ## Next Steps | |
| 1. β Setup complete: Create `.env` with Groq API key | |
| 2. β Run quick test: `python run_evaluation.py --max-samples 5` | |
| 3. β Validate results match expected ranges | |
| 4. β Run full evaluation: `python run_evaluation.py --max-samples 300` | |
| 5. β Compare final results with paper's Table 1-7 | |
| ## Project Compliance Summary | |
| β **4 RAG Abilities Implemented:** | |
| - Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%) | |
| - Negative Rejection (exact phrase matching) | |
| - Information Integration (multi-document synthesis) | |
| - Counterfactual Robustness (error detection & correction) | |
| β **3+ LLM Models Supported:** | |
| - llama-3.3-70b-versatile (highest quality) | |
| - llama-3.1-8b-instant (fastest) | |
| - mixtral-8x7b-32768 (balanced) | |
| β **Exact Figure 3 Prompt Format:** | |
| - System instruction specifying task behavior | |
| - Input format: "Document:\n{documents}\n\nQuestion: {question}" | |
| β **All Required Metrics:** | |
| - Accuracy (for noise robustness, information integration) | |
| - Rejection rate (for negative rejection) | |
| - Error detection & correction rates (for counterfactual robustness) | |
| β **Type-Safe & Well-Tested:** | |
| - No type errors or warnings | |
| - Comprehensive test suite included | |
| - All components validated | |
| --- | |
| **Status: Ready for evaluation with Groq free API tier** | |
| For questions or issues, refer to: | |
| - `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes | |
| - `DETAILED_CHANGES.md` - Specific code modifications | |
| - `README.md` - Original project documentation | |