RGBMetrics / SETUP_AND_EXECUTION_GUIDE.md
RGB Evaluation
feat: Add separate grid layout for 4 RAG abilities in Streamlit UI
af25c62

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

RGB RAG Evaluation Pipeline - Setup & Execution Guide

Project Structure

d:\CapStoneProject\RGB\
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm_client.py          # Groq API client with rate limiting
β”‚   β”œβ”€β”€ data_loader.py          # RGB dataset loader (4 tasks)
β”‚   β”œβ”€β”€ evaluator.py            # Metric calculations (accuracy, rejection, error rates)
β”‚   β”œβ”€β”€ prompts.py              # Figure 3 prompt templates
β”‚   β”œβ”€β”€ pipeline.py             # Main evaluation pipeline
β”‚   └── config.py               # Configuration constants
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ en_refine.json          # Noise robustness + negative rejection data (7.2 MB, 300 samples)
β”‚   β”œβ”€β”€ en_int.json             # Information integration data (5.1 MB, 100 samples)
β”‚   └── en_fact.json            # Counterfactual robustness data (241 KB, 100 samples)
β”œβ”€β”€ .venv/                      # Python 3.13.1 virtual environment
β”œβ”€β”€ requirements.txt            # Dependencies (groq, tqdm, requests, pandas, pymupdf)
β”œβ”€β”€ .env.example                # Template for GROQ_API_KEY
β”œβ”€β”€ quick_test.py               # Quick connectivity test
β”œβ”€β”€ run_evaluation.py            # Main evaluation script
β”œβ”€β”€ download_datasets.py         # Dataset download script (already executed)
β”œβ”€β”€ test_refactored_pipeline.py  # Comprehensive test suite
β”œβ”€β”€ COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
β”œβ”€β”€ DETAILED_CHANGES.md          # Detailed code changes per file
└── README.md                    # Original project documentation

Setup Instructions

Step 1: Get Groq API Key

  1. Visit https://console.groq.com/keys
  2. Sign up (free tier available)
  3. Create API key for your account
  4. Copy the key to your clipboard

Step 2: Create .env File

Create file: d:\CapStoneProject\RGB\.env

GROQ_API_KEY=your_api_key_here

Replace your_api_key_here with your actual Groq API key.

Step 3: Verify Setup

Run quick test to verify everything is connected:

cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py

Expected output:

βœ“ GROQ_API_KEY found
βœ“ Groq client initialized
βœ“ Testing LLM connectivity...
βœ“ Models available: 3
βœ“ Ready for evaluation!

Evaluation Execution

Option 1: Quick Test (5 samples per task)

For testing the complete pipeline:

.venv\Scripts\python.exe run_evaluation.py --max-samples 5

This will:

  • Test all 4 RAG abilities
  • Use the 3 default models
  • Take ~10-20 seconds per model
  • Output results to results/evaluation_results.json

Option 2: Medium Test (20 samples per task)

For validating results:

.venv\Scripts\python.exe run_evaluation.py --max-samples 20

This will:

  • Test all 4 RAG abilities
  • Use 20 samples per noise ratio (100+ total for noise robustness)
  • Take 2-5 minutes per model
  • Generate reliable metrics

Option 3: Full Evaluation (300 samples per task)

For final results matching paper:

.venv\Scripts\python.exe run_evaluation.py --max-samples 300

This will:

  • Use all available samples
  • Test at 5 noise ratios with full data
  • Take 10-30 minutes per model
  • Match paper's Table 1-7 methodology

Understanding the Results

Output File: results/evaluation_results.json

Example structure:

{
  "timestamp": "2024-01-15 10:30:45",
  "models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
  "results": [
    {
      "task_type": "noise_robustness_0%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 285,
      "incorrect": 15,
      "accuracy": 95.00,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    {
      "task_type": "noise_robustness_20%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 265,
      "incorrect": 35,
      "accuracy": 88.33,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    // ... more results for each noise ratio and task
    {
      "task_type": "negative_rejection",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "rejected": 285,
      "rejection_rate": 95.00
    },
    {
      "task_type": "counterfactual_robustness",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 100,
      "error_detection_count": 85,
      "error_detection_rate": 85.00,
      "error_correction_count": 80,
      "error_correction_rate": 80.00
    }
  ]
}

Metric Definitions

Accuracy (%): Percentage of correct answers

  • Used for: Noise Robustness, Information Integration
  • Formula: (correct_answers / total_samples) Γ— 100

Rejection Rate (%): Percentage of appropriate rejections

  • Used for: Negative Rejection
  • Formula: (rejected_responses / total_samples) Γ— 100
  • Expected: ~95%+ for good models

Error Detection Rate (%): Percentage of factual errors detected

  • Used for: Counterfactual Robustness
  • Formula: (detected_errors / total_samples) Γ— 100
  • Expected: 70%+ for good models

Error Correction Rate (%): Percentage of detected errors corrected

  • Used for: Counterfactual Robustness
  • Formula: (corrected_errors / detected_errors) Γ— 100
  • Expected: 90%+ for good models

Performance Notes

Groq Free Tier Limits

  • API rate limit: ~30 requests per minute
  • Response time: 100-500ms per request
  • Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b

Estimated Evaluation Times

Task Samples Time per Model
Quick Test (5) 25 total 30-60 seconds
Medium Test (20) 100 total 2-5 minutes
Full Evaluation (300) 1500 total 15-30 minutes

Model Recommendations

  1. llama-3.3-70b-versatile (Best quality, slowest)

    • Best for: Most accurate evaluations
    • Speed: ~300ms per request
  2. llama-3.1-8b-instant (Fast, good quality)

    • Best for: Quick testing and validation
    • Speed: ~100ms per request
  3. mixtral-8x7b-32768 (Balanced)

    • Best for: Production evaluations
    • Speed: ~200ms per request

Troubleshooting

Issue: GROQ_API_KEY not found

Solution:

  • Create .env file with GROQ_API_KEY=your_key
  • Or set environment variable: $env:GROQ_API_KEY='your_key'

Issue: Rate limit errors

Solution:

  • LLM client has built-in rate limiting (0.5s between requests)
  • If still occurring, reduce --max-samples parameter
  • Use smaller models (llama-3.1-8b-instant) for testing

Issue: Out of memory

Solution:

  • Run with smaller batch: --max-samples 50
  • Use faster model: llama-3.1-8b-instant

Issue: Type errors

Solution:

  • Verify Python 3.10+ is being used
  • Reinstall dependencies: pip install -r requirements.txt

Validation Against Paper

After running evaluation, compare results with RGB benchmark paper Table 1-7:

Table 1: Noise Robustness (Page 8)

  • Expected: Accuracy decreases as noise ratio increases (0% β†’ 80%)
  • Validate: Compare your accuracy curve with paper's

Table 2: Negative Rejection (Page 8)

  • Expected: Rejection rate ~95%+ for good models
  • Validate: Check rejection_rate metric

Table 3: Information Integration (Page 9)

  • Expected: Accuracy 85%+ for good models
  • Validate: Compare integrated answer accuracy

Table 4: Counterfactual Robustness (Page 9)

  • Expected: Error detection 70%+, error correction 80%+
  • Validate: Check both rates in results

Next Steps

  1. βœ… Setup complete: Create .env with Groq API key
  2. βœ… Run quick test: python run_evaluation.py --max-samples 5
  3. βœ… Validate results match expected ranges
  4. βœ… Run full evaluation: python run_evaluation.py --max-samples 300
  5. βœ… Compare final results with paper's Table 1-7

Project Compliance Summary

βœ… 4 RAG Abilities Implemented:

  • Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
  • Negative Rejection (exact phrase matching)
  • Information Integration (multi-document synthesis)
  • Counterfactual Robustness (error detection & correction)

βœ… 3+ LLM Models Supported:

  • llama-3.3-70b-versatile (highest quality)
  • llama-3.1-8b-instant (fastest)
  • mixtral-8x7b-32768 (balanced)

βœ… Exact Figure 3 Prompt Format:

  • System instruction specifying task behavior
  • Input format: "Document:\n{documents}\n\nQuestion: {question}"

βœ… All Required Metrics:

  • Accuracy (for noise robustness, information integration)
  • Rejection rate (for negative rejection)
  • Error detection & correction rates (for counterfactual robustness)

βœ… Type-Safe & Well-Tested:

  • No type errors or warnings
  • Comprehensive test suite included
  • All components validated

Status: Ready for evaluation with Groq free API tier

For questions or issues, refer to:

  • COMPLIANCE_REVIEW_SUMMARY.md - Overview of changes
  • DETAILED_CHANGES.md - Specific code modifications
  • README.md - Original project documentation