# RGB RAG Evaluation Pipeline - Setup & Execution Guide

## Project Structure

```
d:\CapStoneProject\RGB\
├── src/
│   ├── __init__.py
│   ├── llm_client.py          # Groq API client with rate limiting
│   ├── data_loader.py          # RGB dataset loader (4 tasks)
│   ├── evaluator.py            # Metric calculations (accuracy, rejection, error rates)
│   ├── prompts.py              # Figure 3 prompt templates
│   ├── pipeline.py             # Main evaluation pipeline
│   └── config.py               # Configuration constants
├── data/
│   ├── en_refine.json          # Noise robustness + negative rejection data (7.2 MB, 300 samples)
│   ├── en_int.json             # Information integration data (5.1 MB, 100 samples)
│   └── en_fact.json            # Counterfactual robustness data (241 KB, 100 samples)
├── .venv/                      # Python 3.13.1 virtual environment
├── requirements.txt            # Dependencies (groq, tqdm, requests, pandas, pymupdf)
├── .env.example                # Template for GROQ_API_KEY
├── quick_test.py               # Quick connectivity test
├── run_evaluation.py            # Main evaluation script
├── download_datasets.py         # Dataset download script (already executed)
├── test_refactored_pipeline.py  # Comprehensive test suite
├── COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
├── DETAILED_CHANGES.md          # Detailed code changes per file
└── README.md                    # Original project documentation
```

## Setup Instructions

### Step 1: Get Groq API Key
1. Visit https://console.groq.com/keys
2. Sign up (free tier available)
3. Create API key for your account
4. Copy the key to your clipboard

### Step 2: Create .env File
Create file: `d:\CapStoneProject\RGB\.env`

```
GROQ_API_KEY=your_api_key_here
```

Replace `your_api_key_here` with your actual Groq API key.

### Step 3: Verify Setup
Run quick test to verify everything is connected:

```bash
cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py
```

Expected output:
```
✓ GROQ_API_KEY found
✓ Groq client initialized
✓ Testing LLM connectivity...
✓ Models available: 3
✓ Ready for evaluation!
```

## Evaluation Execution

### Option 1: Quick Test (5 samples per task)
For testing the complete pipeline:

```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 5
```

This will:
- Test all 4 RAG abilities
- Use the 3 default models
- Take ~10-20 seconds per model
- Output results to `results/evaluation_results.json`

### Option 2: Medium Test (20 samples per task)
For validating results:

```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 20
```

This will:
- Test all 4 RAG abilities
- Use 20 samples per noise ratio (100+ total for noise robustness)
- Take 2-5 minutes per model
- Generate reliable metrics

### Option 3: Full Evaluation (300 samples per task)
For final results matching paper:

```bash
.venv\Scripts\python.exe run_evaluation.py --max-samples 300
```

This will:
- Use all available samples
- Test at 5 noise ratios with full data
- Take 10-30 minutes per model
- Match paper's Table 1-7 methodology

## Understanding the Results

### Output File: results/evaluation_results.json

Example structure:
```json
{
  "timestamp": "2024-01-15 10:30:45",
  "models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
  "results": [
    {
      "task_type": "noise_robustness_0%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 285,
      "incorrect": 15,
      "accuracy": 95.00,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    {
      "task_type": "noise_robustness_20%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 265,
      "incorrect": 35,
      "accuracy": 88.33,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    // ... more results for each noise ratio and task
    {
      "task_type": "negative_rejection",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "rejected": 285,
      "rejection_rate": 95.00
    },
    {
      "task_type": "counterfactual_robustness",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 100,
      "error_detection_count": 85,
      "error_detection_rate": 85.00,
      "error_correction_count": 80,
      "error_correction_rate": 80.00
    }
  ]
}
```

### Metric Definitions

**Accuracy (%)**: Percentage of correct answers
- Used for: Noise Robustness, Information Integration
- Formula: (correct_answers / total_samples) × 100

**Rejection Rate (%)**: Percentage of appropriate rejections
- Used for: Negative Rejection
- Formula: (rejected_responses / total_samples) × 100
- Expected: ~95%+ for good models

**Error Detection Rate (%)**: Percentage of factual errors detected
- Used for: Counterfactual Robustness
- Formula: (detected_errors / total_samples) × 100
- Expected: 70%+ for good models

**Error Correction Rate (%)**: Percentage of detected errors corrected
- Used for: Counterfactual Robustness
- Formula: (corrected_errors / detected_errors) × 100
- Expected: 90%+ for good models

## Performance Notes

### Groq Free Tier Limits
- API rate limit: ~30 requests per minute
- Response time: 100-500ms per request
- Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b

### Estimated Evaluation Times
| Task | Samples | Time per Model |
|------|---------|----------------|
| Quick Test (5) | 25 total | 30-60 seconds |
| Medium Test (20) | 100 total | 2-5 minutes |
| Full Evaluation (300) | 1500 total | 15-30 minutes |

### Model Recommendations
1. **llama-3.3-70b-versatile** (Best quality, slowest)
   - Best for: Most accurate evaluations
   - Speed: ~300ms per request
   
2. **llama-3.1-8b-instant** (Fast, good quality)
   - Best for: Quick testing and validation
   - Speed: ~100ms per request
   
3. **mixtral-8x7b-32768** (Balanced)
   - Best for: Production evaluations
   - Speed: ~200ms per request

## Troubleshooting

### Issue: GROQ_API_KEY not found
**Solution:** 
- Create `.env` file with `GROQ_API_KEY=your_key`
- Or set environment variable: `$env:GROQ_API_KEY='your_key'`

### Issue: Rate limit errors
**Solution:**
- LLM client has built-in rate limiting (0.5s between requests)
- If still occurring, reduce `--max-samples` parameter
- Use smaller models (llama-3.1-8b-instant) for testing

### Issue: Out of memory
**Solution:**
- Run with smaller batch: `--max-samples 50`
- Use faster model: llama-3.1-8b-instant

### Issue: Type errors
**Solution:**
- Verify Python 3.10+ is being used
- Reinstall dependencies: `pip install -r requirements.txt`

## Validation Against Paper

After running evaluation, compare results with RGB benchmark paper Table 1-7:

**Table 1: Noise Robustness** (Page 8)
- Expected: Accuracy decreases as noise ratio increases (0% → 80%)
- Validate: Compare your accuracy curve with paper's

**Table 2: Negative Rejection** (Page 8)
- Expected: Rejection rate ~95%+ for good models
- Validate: Check rejection_rate metric

**Table 3: Information Integration** (Page 9)
- Expected: Accuracy 85%+ for good models
- Validate: Compare integrated answer accuracy

**Table 4: Counterfactual Robustness** (Page 9)
- Expected: Error detection 70%+, error correction 80%+
- Validate: Check both rates in results

## Next Steps

1. ✅ Setup complete: Create `.env` with Groq API key
2. ✅ Run quick test: `python run_evaluation.py --max-samples 5`
3. ✅ Validate results match expected ranges
4. ✅ Run full evaluation: `python run_evaluation.py --max-samples 300`
5. ✅ Compare final results with paper's Table 1-7

## Project Compliance Summary

✅ **4 RAG Abilities Implemented:**
- Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
- Negative Rejection (exact phrase matching)
- Information Integration (multi-document synthesis)
- Counterfactual Robustness (error detection & correction)

✅ **3+ LLM Models Supported:**
- llama-3.3-70b-versatile (highest quality)
- llama-3.1-8b-instant (fastest)
- mixtral-8x7b-32768 (balanced)

✅ **Exact Figure 3 Prompt Format:**
- System instruction specifying task behavior
- Input format: "Document:\n{documents}\n\nQuestion: {question}"

✅ **All Required Metrics:**
- Accuracy (for noise robustness, information integration)
- Rejection rate (for negative rejection)
- Error detection & correction rates (for counterfactual robustness)

✅ **Type-Safe & Well-Tested:**
- No type errors or warnings
- Comprehensive test suite included
- All components validated

---

**Status: Ready for evaluation with Groq free API tier**

For questions or issues, refer to:
- `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes
- `DETAILED_CHANGES.md` - Specific code modifications
- `README.md` - Original project documentation