Spaces:
Sleeping
Sleeping
File size: 8,948 Bytes
af25c62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | # RGB RAG Evaluation Pipeline - Setup & Execution Guide
## Project Structure
```
d:\CapStoneProject\RGB\
βββ src/
β βββ __init__.py
β βββ llm_client.py # Groq API client with rate limiting
β βββ data_loader.py # RGB dataset loader (4 tasks)
β βββ evaluator.py # Metric calculations (accuracy, rejection, error rates)
β βββ prompts.py # Figure 3 prompt templates
β βββ pipeline.py # Main evaluation pipeline
β βββ config.py # Configuration constants
βββ data/
β βββ en_refine.json # Noise robustness + negative rejection data (7.2 MB, 300 samples)
β βββ en_int.json # Information integration data (5.1 MB, 100 samples)
β βββ en_fact.json # Counterfactual robustness data (241 KB, 100 samples)
βββ .venv/ # Python 3.13.1 virtual environment
βββ requirements.txt # Dependencies (groq, tqdm, requests, pandas, pymupdf)
βββ .env.example # Template for GROQ_API_KEY
βββ quick_test.py # Quick connectivity test
βββ run_evaluation.py # Main evaluation script
βββ download_datasets.py # Dataset download script (already executed)
βββ test_refactored_pipeline.py # Comprehensive test suite
βββ COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
βββ DETAILED_CHANGES.md # Detailed code changes per file
βββ README.md # Original project documentation
```
## Setup Instructions
### Step 1: Get Groq API Key
1. Visit https://console.groq.com/keys
2. Sign up (free tier available)
3. Create API key for your account
4. Copy the key to your clipboard
### Step 2: Create .env File
Create file: `d:\CapStoneProject\RGB\.env`
```
GROQ_API_KEY=your_api_key_here
```
Replace `your_api_key_here` with your actual Groq API key.
### Step 3: Verify Setup
Run quick test to verify everything is connected:
```bash
cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py
```
Expected output:
```
β GROQ_API_KEY found
β Groq client initialized
β Testing LLM connectivity...
β Models available: 3
β Ready for evaluation!
```
## Evaluation Execution
### Option 1: Quick Test (5 samples per task)
For testing the complete pipeline:
```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 5
```
This will:
- Test all 4 RAG abilities
- Use the 3 default models
- Take ~10-20 seconds per model
- Output results to `results/evaluation_results.json`
### Option 2: Medium Test (20 samples per task)
For validating results:
```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 20
```
This will:
- Test all 4 RAG abilities
- Use 20 samples per noise ratio (100+ total for noise robustness)
- Take 2-5 minutes per model
- Generate reliable metrics
### Option 3: Full Evaluation (300 samples per task)
For final results matching paper:
```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 300
```
This will:
- Use all available samples
- Test at 5 noise ratios with full data
- Take 10-30 minutes per model
- Match paper's Table 1-7 methodology
## Understanding the Results
### Output File: results/evaluation_results.json
Example structure:
```json
{
"timestamp": "2024-01-15 10:30:45",
"models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
"results": [
{
"task_type": "noise_robustness_0%",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"correct": 285,
"incorrect": 15,
"accuracy": 95.00,
"rejected": 0,
"rejection_rate": 0.00
},
{
"task_type": "noise_robustness_20%",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"correct": 265,
"incorrect": 35,
"accuracy": 88.33,
"rejected": 0,
"rejection_rate": 0.00
},
// ... more results for each noise ratio and task
{
"task_type": "negative_rejection",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"rejected": 285,
"rejection_rate": 95.00
},
{
"task_type": "counterfactual_robustness",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 100,
"error_detection_count": 85,
"error_detection_rate": 85.00,
"error_correction_count": 80,
"error_correction_rate": 80.00
}
]
}
```
### Metric Definitions
**Accuracy (%)**: Percentage of correct answers
- Used for: Noise Robustness, Information Integration
- Formula: (correct_answers / total_samples) Γ 100
**Rejection Rate (%)**: Percentage of appropriate rejections
- Used for: Negative Rejection
- Formula: (rejected_responses / total_samples) Γ 100
- Expected: ~95%+ for good models
**Error Detection Rate (%)**: Percentage of factual errors detected
- Used for: Counterfactual Robustness
- Formula: (detected_errors / total_samples) Γ 100
- Expected: 70%+ for good models
**Error Correction Rate (%)**: Percentage of detected errors corrected
- Used for: Counterfactual Robustness
- Formula: (corrected_errors / detected_errors) Γ 100
- Expected: 90%+ for good models
## Performance Notes
### Groq Free Tier Limits
- API rate limit: ~30 requests per minute
- Response time: 100-500ms per request
- Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b
### Estimated Evaluation Times
| Task | Samples | Time per Model |
|------|---------|----------------|
| Quick Test (5) | 25 total | 30-60 seconds |
| Medium Test (20) | 100 total | 2-5 minutes |
| Full Evaluation (300) | 1500 total | 15-30 minutes |
### Model Recommendations
1. **llama-3.3-70b-versatile** (Best quality, slowest)
- Best for: Most accurate evaluations
- Speed: ~300ms per request
2. **llama-3.1-8b-instant** (Fast, good quality)
- Best for: Quick testing and validation
- Speed: ~100ms per request
3. **mixtral-8x7b-32768** (Balanced)
- Best for: Production evaluations
- Speed: ~200ms per request
## Troubleshooting
### Issue: GROQ_API_KEY not found
**Solution:**
- Create `.env` file with `GROQ_API_KEY=your_key`
- Or set environment variable: `$env:GROQ_API_KEY='your_key'`
### Issue: Rate limit errors
**Solution:**
- LLM client has built-in rate limiting (0.5s between requests)
- If still occurring, reduce `--max-samples` parameter
- Use smaller models (llama-3.1-8b-instant) for testing
### Issue: Out of memory
**Solution:**
- Run with smaller batch: `--max-samples 50`
- Use faster model: llama-3.1-8b-instant
### Issue: Type errors
**Solution:**
- Verify Python 3.10+ is being used
- Reinstall dependencies: `pip install -r requirements.txt`
## Validation Against Paper
After running evaluation, compare results with RGB benchmark paper Table 1-7:
**Table 1: Noise Robustness** (Page 8)
- Expected: Accuracy decreases as noise ratio increases (0% β 80%)
- Validate: Compare your accuracy curve with paper's
**Table 2: Negative Rejection** (Page 8)
- Expected: Rejection rate ~95%+ for good models
- Validate: Check rejection_rate metric
**Table 3: Information Integration** (Page 9)
- Expected: Accuracy 85%+ for good models
- Validate: Compare integrated answer accuracy
**Table 4: Counterfactual Robustness** (Page 9)
- Expected: Error detection 70%+, error correction 80%+
- Validate: Check both rates in results
## Next Steps
1. β
Setup complete: Create `.env` with Groq API key
2. β
Run quick test: `python run_evaluation.py --max-samples 5`
3. β
Validate results match expected ranges
4. β
Run full evaluation: `python run_evaluation.py --max-samples 300`
5. β
Compare final results with paper's Table 1-7
## Project Compliance Summary
β
**4 RAG Abilities Implemented:**
- Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
- Negative Rejection (exact phrase matching)
- Information Integration (multi-document synthesis)
- Counterfactual Robustness (error detection & correction)
β
**3+ LLM Models Supported:**
- llama-3.3-70b-versatile (highest quality)
- llama-3.1-8b-instant (fastest)
- mixtral-8x7b-32768 (balanced)
β
**Exact Figure 3 Prompt Format:**
- System instruction specifying task behavior
- Input format: "Document:\n{documents}\n\nQuestion: {question}"
β
**All Required Metrics:**
- Accuracy (for noise robustness, information integration)
- Rejection rate (for negative rejection)
- Error detection & correction rates (for counterfactual robustness)
β
**Type-Safe & Well-Tested:**
- No type errors or warnings
- Comprehensive test suite included
- All components validated
---
**Status: Ready for evaluation with Groq free API tier**
For questions or issues, refer to:
- `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes
- `DETAILED_CHANGES.md` - Specific code modifications
- `README.md` - Original project documentation
|