File size: 8,948 Bytes
af25c62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# RGB RAG Evaluation Pipeline - Setup & Execution Guide

## Project Structure

```
d:\CapStoneProject\RGB\
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm_client.py          # Groq API client with rate limiting
β”‚   β”œβ”€β”€ data_loader.py          # RGB dataset loader (4 tasks)
β”‚   β”œβ”€β”€ evaluator.py            # Metric calculations (accuracy, rejection, error rates)
β”‚   β”œβ”€β”€ prompts.py              # Figure 3 prompt templates
β”‚   β”œβ”€β”€ pipeline.py             # Main evaluation pipeline
β”‚   └── config.py               # Configuration constants
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ en_refine.json          # Noise robustness + negative rejection data (7.2 MB, 300 samples)
β”‚   β”œβ”€β”€ en_int.json             # Information integration data (5.1 MB, 100 samples)
β”‚   └── en_fact.json            # Counterfactual robustness data (241 KB, 100 samples)
β”œβ”€β”€ .venv/                      # Python 3.13.1 virtual environment
β”œβ”€β”€ requirements.txt            # Dependencies (groq, tqdm, requests, pandas, pymupdf)
β”œβ”€β”€ .env.example                # Template for GROQ_API_KEY
β”œβ”€β”€ quick_test.py               # Quick connectivity test
β”œβ”€β”€ run_evaluation.py            # Main evaluation script
β”œβ”€β”€ download_datasets.py         # Dataset download script (already executed)
β”œβ”€β”€ test_refactored_pipeline.py  # Comprehensive test suite
β”œβ”€β”€ COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
β”œβ”€β”€ DETAILED_CHANGES.md          # Detailed code changes per file
└── README.md                    # Original project documentation
```

## Setup Instructions

### Step 1: Get Groq API Key
1. Visit https://console.groq.com/keys
2. Sign up (free tier available)
3. Create API key for your account
4. Copy the key to your clipboard

### Step 2: Create .env File
Create file: `d:\CapStoneProject\RGB\.env`

```
GROQ_API_KEY=your_api_key_here
```

Replace `your_api_key_here` with your actual Groq API key.

### Step 3: Verify Setup
Run quick test to verify everything is connected:

```bash
cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py
```

Expected output:
```
βœ“ GROQ_API_KEY found
βœ“ Groq client initialized
βœ“ Testing LLM connectivity...
βœ“ Models available: 3
βœ“ Ready for evaluation!
```

## Evaluation Execution

### Option 1: Quick Test (5 samples per task)
For testing the complete pipeline:

```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 5
```

This will:
- Test all 4 RAG abilities
- Use the 3 default models
- Take ~10-20 seconds per model
- Output results to `results/evaluation_results.json`

### Option 2: Medium Test (20 samples per task)
For validating results:

```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 20
```

This will:
- Test all 4 RAG abilities
- Use 20 samples per noise ratio (100+ total for noise robustness)
- Take 2-5 minutes per model
- Generate reliable metrics

### Option 3: Full Evaluation (300 samples per task)
For final results matching paper:

```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 300
```

This will:
- Use all available samples
- Test at 5 noise ratios with full data
- Take 10-30 minutes per model
- Match paper's Table 1-7 methodology

## Understanding the Results

### Output File: results/evaluation_results.json

Example structure:
```json
{
  "timestamp": "2024-01-15 10:30:45",
  "models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
  "results": [
    {
      "task_type": "noise_robustness_0%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 285,
      "incorrect": 15,
      "accuracy": 95.00,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    {
      "task_type": "noise_robustness_20%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 265,
      "incorrect": 35,
      "accuracy": 88.33,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    // ... more results for each noise ratio and task
    {
      "task_type": "negative_rejection",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "rejected": 285,
      "rejection_rate": 95.00
    },
    {
      "task_type": "counterfactual_robustness",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 100,
      "error_detection_count": 85,
      "error_detection_rate": 85.00,
      "error_correction_count": 80,
      "error_correction_rate": 80.00
    }
  ]
}
```

### Metric Definitions

**Accuracy (%)**: Percentage of correct answers
- Used for: Noise Robustness, Information Integration
- Formula: (correct_answers / total_samples) Γ— 100

**Rejection Rate (%)**: Percentage of appropriate rejections
- Used for: Negative Rejection
- Formula: (rejected_responses / total_samples) Γ— 100
- Expected: ~95%+ for good models

**Error Detection Rate (%)**: Percentage of factual errors detected
- Used for: Counterfactual Robustness
- Formula: (detected_errors / total_samples) Γ— 100
- Expected: 70%+ for good models

**Error Correction Rate (%)**: Percentage of detected errors corrected
- Used for: Counterfactual Robustness
- Formula: (corrected_errors / detected_errors) Γ— 100
- Expected: 90%+ for good models

## Performance Notes

### Groq Free Tier Limits
- API rate limit: ~30 requests per minute
- Response time: 100-500ms per request
- Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b

### Estimated Evaluation Times
| Task | Samples | Time per Model |
|------|---------|----------------|
| Quick Test (5) | 25 total | 30-60 seconds |
| Medium Test (20) | 100 total | 2-5 minutes |
| Full Evaluation (300) | 1500 total | 15-30 minutes |

### Model Recommendations
1. **llama-3.3-70b-versatile** (Best quality, slowest)
   - Best for: Most accurate evaluations
   - Speed: ~300ms per request
   
2. **llama-3.1-8b-instant** (Fast, good quality)
   - Best for: Quick testing and validation
   - Speed: ~100ms per request
   
3. **mixtral-8x7b-32768** (Balanced)
   - Best for: Production evaluations
   - Speed: ~200ms per request

## Troubleshooting

### Issue: GROQ_API_KEY not found
**Solution:** 
- Create `.env` file with `GROQ_API_KEY=your_key`
- Or set environment variable: `$env:GROQ_API_KEY='your_key'`

### Issue: Rate limit errors
**Solution:**
- LLM client has built-in rate limiting (0.5s between requests)
- If still occurring, reduce `--max-samples` parameter
- Use smaller models (llama-3.1-8b-instant) for testing

### Issue: Out of memory
**Solution:**
- Run with smaller batch: `--max-samples 50`
- Use faster model: llama-3.1-8b-instant

### Issue: Type errors
**Solution:**
- Verify Python 3.10+ is being used
- Reinstall dependencies: `pip install -r requirements.txt`

## Validation Against Paper

After running evaluation, compare results with RGB benchmark paper Table 1-7:

**Table 1: Noise Robustness** (Page 8)
- Expected: Accuracy decreases as noise ratio increases (0% β†’ 80%)
- Validate: Compare your accuracy curve with paper's

**Table 2: Negative Rejection** (Page 8)
- Expected: Rejection rate ~95%+ for good models
- Validate: Check rejection_rate metric

**Table 3: Information Integration** (Page 9)
- Expected: Accuracy 85%+ for good models
- Validate: Compare integrated answer accuracy

**Table 4: Counterfactual Robustness** (Page 9)
- Expected: Error detection 70%+, error correction 80%+
- Validate: Check both rates in results

## Next Steps

1. βœ… Setup complete: Create `.env` with Groq API key
2. βœ… Run quick test: `python run_evaluation.py --max-samples 5`
3. βœ… Validate results match expected ranges
4. βœ… Run full evaluation: `python run_evaluation.py --max-samples 300`
5. βœ… Compare final results with paper's Table 1-7

## Project Compliance Summary

βœ… **4 RAG Abilities Implemented:**
- Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
- Negative Rejection (exact phrase matching)
- Information Integration (multi-document synthesis)
- Counterfactual Robustness (error detection & correction)

βœ… **3+ LLM Models Supported:**
- llama-3.3-70b-versatile (highest quality)
- llama-3.1-8b-instant (fastest)
- mixtral-8x7b-32768 (balanced)

βœ… **Exact Figure 3 Prompt Format:**
- System instruction specifying task behavior
- Input format: "Document:\n{documents}\n\nQuestion: {question}"

βœ… **All Required Metrics:**
- Accuracy (for noise robustness, information integration)
- Rejection rate (for negative rejection)
- Error detection & correction rates (for counterfactual robustness)

βœ… **Type-Safe & Well-Tested:**
- No type errors or warnings
- Comprehensive test suite included
- All components validated

---

**Status: Ready for evaluation with Groq free API tier**

For questions or issues, refer to:
- `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes
- `DETAILED_CHANGES.md` - Specific code modifications
- `README.md` - Original project documentation