RGBMetrics / SETUP_AND_EXECUTION_GUIDE.md
RGB Evaluation
feat: Add separate grid layout for 4 RAG abilities in Streamlit UI
af25c62
# RGB RAG Evaluation Pipeline - Setup & Execution Guide
## Project Structure
```
d:\CapStoneProject\RGB\
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ llm_client.py # Groq API client with rate limiting
β”‚ β”œβ”€β”€ data_loader.py # RGB dataset loader (4 tasks)
β”‚ β”œβ”€β”€ evaluator.py # Metric calculations (accuracy, rejection, error rates)
β”‚ β”œβ”€β”€ prompts.py # Figure 3 prompt templates
β”‚ β”œβ”€β”€ pipeline.py # Main evaluation pipeline
β”‚ └── config.py # Configuration constants
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ en_refine.json # Noise robustness + negative rejection data (7.2 MB, 300 samples)
β”‚ β”œβ”€β”€ en_int.json # Information integration data (5.1 MB, 100 samples)
β”‚ └── en_fact.json # Counterfactual robustness data (241 KB, 100 samples)
β”œβ”€β”€ .venv/ # Python 3.13.1 virtual environment
β”œβ”€β”€ requirements.txt # Dependencies (groq, tqdm, requests, pandas, pymupdf)
β”œβ”€β”€ .env.example # Template for GROQ_API_KEY
β”œβ”€β”€ quick_test.py # Quick connectivity test
β”œβ”€β”€ run_evaluation.py # Main evaluation script
β”œβ”€β”€ download_datasets.py # Dataset download script (already executed)
β”œβ”€β”€ test_refactored_pipeline.py # Comprehensive test suite
β”œβ”€β”€ COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
β”œβ”€β”€ DETAILED_CHANGES.md # Detailed code changes per file
└── README.md # Original project documentation
```
## Setup Instructions
### Step 1: Get Groq API Key
1. Visit https://console.groq.com/keys
2. Sign up (free tier available)
3. Create API key for your account
4. Copy the key to your clipboard
### Step 2: Create .env File
Create file: `d:\CapStoneProject\RGB\.env`
```
GROQ_API_KEY=your_api_key_here
```
Replace `your_api_key_here` with your actual Groq API key.
### Step 3: Verify Setup
Run quick test to verify everything is connected:
```bash
cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py
```
Expected output:
```
βœ“ GROQ_API_KEY found
βœ“ Groq client initialized
βœ“ Testing LLM connectivity...
βœ“ Models available: 3
βœ“ Ready for evaluation!
```
## Evaluation Execution
### Option 1: Quick Test (5 samples per task)
For testing the complete pipeline:
```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 5
```
This will:
- Test all 4 RAG abilities
- Use the 3 default models
- Take ~10-20 seconds per model
- Output results to `results/evaluation_results.json`
### Option 2: Medium Test (20 samples per task)
For validating results:
```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 20
```
This will:
- Test all 4 RAG abilities
- Use 20 samples per noise ratio (100+ total for noise robustness)
- Take 2-5 minutes per model
- Generate reliable metrics
### Option 3: Full Evaluation (300 samples per task)
For final results matching paper:
```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 300
```
This will:
- Use all available samples
- Test at 5 noise ratios with full data
- Take 10-30 minutes per model
- Match paper's Table 1-7 methodology
## Understanding the Results
### Output File: results/evaluation_results.json
Example structure:
```json
{
"timestamp": "2024-01-15 10:30:45",
"models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
"results": [
{
"task_type": "noise_robustness_0%",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"correct": 285,
"incorrect": 15,
"accuracy": 95.00,
"rejected": 0,
"rejection_rate": 0.00
},
{
"task_type": "noise_robustness_20%",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"correct": 265,
"incorrect": 35,
"accuracy": 88.33,
"rejected": 0,
"rejection_rate": 0.00
},
// ... more results for each noise ratio and task
{
"task_type": "negative_rejection",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"rejected": 285,
"rejection_rate": 95.00
},
{
"task_type": "counterfactual_robustness",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 100,
"error_detection_count": 85,
"error_detection_rate": 85.00,
"error_correction_count": 80,
"error_correction_rate": 80.00
}
]
}
```
### Metric Definitions
**Accuracy (%)**: Percentage of correct answers
- Used for: Noise Robustness, Information Integration
- Formula: (correct_answers / total_samples) Γ— 100
**Rejection Rate (%)**: Percentage of appropriate rejections
- Used for: Negative Rejection
- Formula: (rejected_responses / total_samples) Γ— 100
- Expected: ~95%+ for good models
**Error Detection Rate (%)**: Percentage of factual errors detected
- Used for: Counterfactual Robustness
- Formula: (detected_errors / total_samples) Γ— 100
- Expected: 70%+ for good models
**Error Correction Rate (%)**: Percentage of detected errors corrected
- Used for: Counterfactual Robustness
- Formula: (corrected_errors / detected_errors) Γ— 100
- Expected: 90%+ for good models
## Performance Notes
### Groq Free Tier Limits
- API rate limit: ~30 requests per minute
- Response time: 100-500ms per request
- Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b
### Estimated Evaluation Times
| Task | Samples | Time per Model |
|------|---------|----------------|
| Quick Test (5) | 25 total | 30-60 seconds |
| Medium Test (20) | 100 total | 2-5 minutes |
| Full Evaluation (300) | 1500 total | 15-30 minutes |
### Model Recommendations
1. **llama-3.3-70b-versatile** (Best quality, slowest)
- Best for: Most accurate evaluations
- Speed: ~300ms per request
2. **llama-3.1-8b-instant** (Fast, good quality)
- Best for: Quick testing and validation
- Speed: ~100ms per request
3. **mixtral-8x7b-32768** (Balanced)
- Best for: Production evaluations
- Speed: ~200ms per request
## Troubleshooting
### Issue: GROQ_API_KEY not found
**Solution:**
- Create `.env` file with `GROQ_API_KEY=your_key`
- Or set environment variable: `$env:GROQ_API_KEY='your_key'`
### Issue: Rate limit errors
**Solution:**
- LLM client has built-in rate limiting (0.5s between requests)
- If still occurring, reduce `--max-samples` parameter
- Use smaller models (llama-3.1-8b-instant) for testing
### Issue: Out of memory
**Solution:**
- Run with smaller batch: `--max-samples 50`
- Use faster model: llama-3.1-8b-instant
### Issue: Type errors
**Solution:**
- Verify Python 3.10+ is being used
- Reinstall dependencies: `pip install -r requirements.txt`
## Validation Against Paper
After running evaluation, compare results with RGB benchmark paper Table 1-7:
**Table 1: Noise Robustness** (Page 8)
- Expected: Accuracy decreases as noise ratio increases (0% β†’ 80%)
- Validate: Compare your accuracy curve with paper's
**Table 2: Negative Rejection** (Page 8)
- Expected: Rejection rate ~95%+ for good models
- Validate: Check rejection_rate metric
**Table 3: Information Integration** (Page 9)
- Expected: Accuracy 85%+ for good models
- Validate: Compare integrated answer accuracy
**Table 4: Counterfactual Robustness** (Page 9)
- Expected: Error detection 70%+, error correction 80%+
- Validate: Check both rates in results
## Next Steps
1. βœ… Setup complete: Create `.env` with Groq API key
2. βœ… Run quick test: `python run_evaluation.py --max-samples 5`
3. βœ… Validate results match expected ranges
4. βœ… Run full evaluation: `python run_evaluation.py --max-samples 300`
5. βœ… Compare final results with paper's Table 1-7
## Project Compliance Summary
βœ… **4 RAG Abilities Implemented:**
- Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
- Negative Rejection (exact phrase matching)
- Information Integration (multi-document synthesis)
- Counterfactual Robustness (error detection & correction)
βœ… **3+ LLM Models Supported:**
- llama-3.3-70b-versatile (highest quality)
- llama-3.1-8b-instant (fastest)
- mixtral-8x7b-32768 (balanced)
βœ… **Exact Figure 3 Prompt Format:**
- System instruction specifying task behavior
- Input format: "Document:\n{documents}\n\nQuestion: {question}"
βœ… **All Required Metrics:**
- Accuracy (for noise robustness, information integration)
- Rejection rate (for negative rejection)
- Error detection & correction rates (for counterfactual robustness)
βœ… **Type-Safe & Well-Tested:**
- No type errors or warnings
- Comprehensive test suite included
- All components validated
---
**Status: Ready for evaluation with Groq free API tier**
For questions or issues, refer to:
- `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes
- `DETAILED_CHANGES.md` - Specific code modifications
- `README.md` - Original project documentation