Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / SETUP_AND_EXECUTION_GUIDE.md

RGB Evaluation

feat: Add separate grid layout for 4 RAG abilities in Streamlit UI

af25c62 about 1 month ago

preview code

raw

history blame contribute delete

8.95 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

RGB RAG Evaluation Pipeline - Setup & Execution Guide

Project Structure

d:\CapStoneProject\RGB\
├── src/
│   ├── __init__.py
│   ├── llm_client.py          # Groq API client with rate limiting
│   ├── data_loader.py          # RGB dataset loader (4 tasks)
│   ├── evaluator.py            # Metric calculations (accuracy, rejection, error rates)
│   ├── prompts.py              # Figure 3 prompt templates
│   ├── pipeline.py             # Main evaluation pipeline
│   └── config.py               # Configuration constants
├── data/
│   ├── en_refine.json          # Noise robustness + negative rejection data (7.2 MB, 300 samples)
│   ├── en_int.json             # Information integration data (5.1 MB, 100 samples)
│   └── en_fact.json            # Counterfactual robustness data (241 KB, 100 samples)
├── .venv/                      # Python 3.13.1 virtual environment
├── requirements.txt            # Dependencies (groq, tqdm, requests, pandas, pymupdf)
├── .env.example                # Template for GROQ_API_KEY
├── quick_test.py               # Quick connectivity test
├── run_evaluation.py            # Main evaluation script
├── download_datasets.py         # Dataset download script (already executed)
├── test_refactored_pipeline.py  # Comprehensive test suite
├── COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
├── DETAILED_CHANGES.md          # Detailed code changes per file
└── README.md                    # Original project documentation

Setup Instructions

Step 1: Get Groq API Key

Visit https://console.groq.com/keys
Sign up (free tier available)
Create API key for your account
Copy the key to your clipboard

Step 2: Create .env File

Create file: d:\CapStoneProject\RGB\.env

GROQ_API_KEY=your_api_key_here

Replace your_api_key_here with your actual Groq API key.

Step 3: Verify Setup

Run quick test to verify everything is connected:

cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py

Expected output:

✓ GROQ_API_KEY found
✓ Groq client initialized
✓ Testing LLM connectivity...
✓ Models available: 3
✓ Ready for evaluation!

Evaluation Execution

Option 1: Quick Test (5 samples per task)

For testing the complete pipeline:

.venv\Scripts\python.exe run_evaluation.py --max-samples 5

This will:

Test all 4 RAG abilities
Use the 3 default models
Take ~10-20 seconds per model
Output results to results/evaluation_results.json

Option 2: Medium Test (20 samples per task)

For validating results:

.venv\Scripts\python.exe run_evaluation.py --max-samples 20

This will:

Test all 4 RAG abilities
Use 20 samples per noise ratio (100+ total for noise robustness)
Take 2-5 minutes per model
Generate reliable metrics

Option 3: Full Evaluation (300 samples per task)

For final results matching paper:

.venv\Scripts\python.exe run_evaluation.py --max-samples 300

This will:

Use all available samples
Test at 5 noise ratios with full data
Take 10-30 minutes per model
Match paper's Table 1-7 methodology

Understanding the Results

Output File: results/evaluation_results.json

Example structure:

{
  "timestamp": "2024-01-15 10:30:45",
  "models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
  "results": [
    {
      "task_type": "noise_robustness_0%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 285,
      "incorrect": 15,
      "accuracy": 95.00,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    {
      "task_type": "noise_robustness_20%",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "correct": 265,
      "incorrect": 35,
      "accuracy": 88.33,
      "rejected": 0,
      "rejection_rate": 0.00
    },
    // ... more results for each noise ratio and task
    {
      "task_type": "negative_rejection",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 300,
      "rejected": 285,
      "rejection_rate": 95.00
    },
    {
      "task_type": "counterfactual_robustness",
      "model_name": "llama-3.3-70b-versatile",
      "total_samples": 100,
      "error_detection_count": 85,
      "error_detection_rate": 85.00,
      "error_correction_count": 80,
      "error_correction_rate": 80.00
    }
  ]
}

Metric Definitions

Accuracy (%): Percentage of correct answers

Used for: Noise Robustness, Information Integration
Formula: (correct_answers / total_samples) × 100

Rejection Rate (%): Percentage of appropriate rejections

Used for: Negative Rejection
Formula: (rejected_responses / total_samples) × 100
Expected: ~95%+ for good models

Error Detection Rate (%): Percentage of factual errors detected

Used for: Counterfactual Robustness
Formula: (detected_errors / total_samples) × 100
Expected: 70%+ for good models

Error Correction Rate (%): Percentage of detected errors corrected

Used for: Counterfactual Robustness
Formula: (corrected_errors / detected_errors) × 100
Expected: 90%+ for good models

Performance Notes

Groq Free Tier Limits

API rate limit: ~30 requests per minute
Response time: 100-500ms per request
Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b

Estimated Evaluation Times

Task	Samples	Time per Model
Quick Test (5)	25 total	30-60 seconds
Medium Test (20)	100 total	2-5 minutes
Full Evaluation (300)	1500 total	15-30 minutes

Model Recommendations

llama-3.3-70b-versatile (Best quality, slowest)
- Best for: Most accurate evaluations
- Speed: ~300ms per request
llama-3.1-8b-instant (Fast, good quality)
- Best for: Quick testing and validation
- Speed: ~100ms per request
mixtral-8x7b-32768 (Balanced)
- Best for: Production evaluations
- Speed: ~200ms per request

Troubleshooting

Issue: GROQ_API_KEY not found

Solution:

Create .env file with GROQ_API_KEY=your_key
Or set environment variable: $env:GROQ_API_KEY='your_key'

Issue: Rate limit errors

Solution:

LLM client has built-in rate limiting (0.5s between requests)
If still occurring, reduce --max-samples parameter
Use smaller models (llama-3.1-8b-instant) for testing

Issue: Out of memory

Solution:

Run with smaller batch: --max-samples 50
Use faster model: llama-3.1-8b-instant

Issue: Type errors

Solution:

Verify Python 3.10+ is being used
Reinstall dependencies: pip install -r requirements.txt

Validation Against Paper

After running evaluation, compare results with RGB benchmark paper Table 1-7:

Table 1: Noise Robustness (Page 8)

Expected: Accuracy decreases as noise ratio increases (0% → 80%)
Validate: Compare your accuracy curve with paper's

Table 2: Negative Rejection (Page 8)

Expected: Rejection rate ~95%+ for good models
Validate: Check rejection_rate metric

Table 3: Information Integration (Page 9)

Expected: Accuracy 85%+ for good models
Validate: Compare integrated answer accuracy

Table 4: Counterfactual Robustness (Page 9)

Expected: Error detection 70%+, error correction 80%+
Validate: Check both rates in results

Next Steps

✅ Setup complete: Create .env with Groq API key
✅ Run quick test: python run_evaluation.py --max-samples 5
✅ Validate results match expected ranges
✅ Run full evaluation: python run_evaluation.py --max-samples 300
✅ Compare final results with paper's Table 1-7

Project Compliance Summary

✅ 4 RAG Abilities Implemented:

Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
Negative Rejection (exact phrase matching)
Information Integration (multi-document synthesis)
Counterfactual Robustness (error detection & correction)

✅ 3+ LLM Models Supported:

llama-3.3-70b-versatile (highest quality)
llama-3.1-8b-instant (fastest)
mixtral-8x7b-32768 (balanced)

✅ Exact Figure 3 Prompt Format:

System instruction specifying task behavior
Input format: "Document:\n{documents}\n\nQuestion: {question}"

✅ All Required Metrics:

Accuracy (for noise robustness, information integration)
Rejection rate (for negative rejection)
Error detection & correction rates (for counterfactual robustness)

✅ Type-Safe & Well-Tested:

No type errors or warnings
Comprehensive test suite included
All components validated

Status: Ready for evaluation with Groq free API tier

For questions or issues, refer to:

COMPLIANCE_REVIEW_SUMMARY.md - Overview of changes
DETAILED_CHANGES.md - Specific code modifications
README.md - Original project documentation