Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.54.0
RGB RAG Evaluation Pipeline - Setup & Execution Guide
Project Structure
d:\CapStoneProject\RGB\
βββ src/
β βββ __init__.py
β βββ llm_client.py # Groq API client with rate limiting
β βββ data_loader.py # RGB dataset loader (4 tasks)
β βββ evaluator.py # Metric calculations (accuracy, rejection, error rates)
β βββ prompts.py # Figure 3 prompt templates
β βββ pipeline.py # Main evaluation pipeline
β βββ config.py # Configuration constants
βββ data/
β βββ en_refine.json # Noise robustness + negative rejection data (7.2 MB, 300 samples)
β βββ en_int.json # Information integration data (5.1 MB, 100 samples)
β βββ en_fact.json # Counterfactual robustness data (241 KB, 100 samples)
βββ .venv/ # Python 3.13.1 virtual environment
βββ requirements.txt # Dependencies (groq, tqdm, requests, pandas, pymupdf)
βββ .env.example # Template for GROQ_API_KEY
βββ quick_test.py # Quick connectivity test
βββ run_evaluation.py # Main evaluation script
βββ download_datasets.py # Dataset download script (already executed)
βββ test_refactored_pipeline.py # Comprehensive test suite
βββ COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
βββ DETAILED_CHANGES.md # Detailed code changes per file
βββ README.md # Original project documentation
Setup Instructions
Step 1: Get Groq API Key
- Visit https://console.groq.com/keys
- Sign up (free tier available)
- Create API key for your account
- Copy the key to your clipboard
Step 2: Create .env File
Create file: d:\CapStoneProject\RGB\.env
GROQ_API_KEY=your_api_key_here
Replace your_api_key_here with your actual Groq API key.
Step 3: Verify Setup
Run quick test to verify everything is connected:
cd d:\CapStoneProject\RGB
.venv\Scripts\python.exe quick_test.py
Expected output:
β GROQ_API_KEY found
β Groq client initialized
β Testing LLM connectivity...
β Models available: 3
β Ready for evaluation!
Evaluation Execution
Option 1: Quick Test (5 samples per task)
For testing the complete pipeline:
.venv\Scripts\python.exe run_evaluation.py --max-samples 5
This will:
- Test all 4 RAG abilities
- Use the 3 default models
- Take ~10-20 seconds per model
- Output results to
results/evaluation_results.json
Option 2: Medium Test (20 samples per task)
For validating results:
.venv\Scripts\python.exe run_evaluation.py --max-samples 20
This will:
- Test all 4 RAG abilities
- Use 20 samples per noise ratio (100+ total for noise robustness)
- Take 2-5 minutes per model
- Generate reliable metrics
Option 3: Full Evaluation (300 samples per task)
For final results matching paper:
.venv\Scripts\python.exe run_evaluation.py --max-samples 300
This will:
- Use all available samples
- Test at 5 noise ratios with full data
- Take 10-30 minutes per model
- Match paper's Table 1-7 methodology
Understanding the Results
Output File: results/evaluation_results.json
Example structure:
{
"timestamp": "2024-01-15 10:30:45",
"models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
"results": [
{
"task_type": "noise_robustness_0%",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"correct": 285,
"incorrect": 15,
"accuracy": 95.00,
"rejected": 0,
"rejection_rate": 0.00
},
{
"task_type": "noise_robustness_20%",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"correct": 265,
"incorrect": 35,
"accuracy": 88.33,
"rejected": 0,
"rejection_rate": 0.00
},
// ... more results for each noise ratio and task
{
"task_type": "negative_rejection",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 300,
"rejected": 285,
"rejection_rate": 95.00
},
{
"task_type": "counterfactual_robustness",
"model_name": "llama-3.3-70b-versatile",
"total_samples": 100,
"error_detection_count": 85,
"error_detection_rate": 85.00,
"error_correction_count": 80,
"error_correction_rate": 80.00
}
]
}
Metric Definitions
Accuracy (%): Percentage of correct answers
- Used for: Noise Robustness, Information Integration
- Formula: (correct_answers / total_samples) Γ 100
Rejection Rate (%): Percentage of appropriate rejections
- Used for: Negative Rejection
- Formula: (rejected_responses / total_samples) Γ 100
- Expected: ~95%+ for good models
Error Detection Rate (%): Percentage of factual errors detected
- Used for: Counterfactual Robustness
- Formula: (detected_errors / total_samples) Γ 100
- Expected: 70%+ for good models
Error Correction Rate (%): Percentage of detected errors corrected
- Used for: Counterfactual Robustness
- Formula: (corrected_errors / detected_errors) Γ 100
- Expected: 90%+ for good models
Performance Notes
Groq Free Tier Limits
- API rate limit: ~30 requests per minute
- Response time: 100-500ms per request
- Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b
Estimated Evaluation Times
| Task | Samples | Time per Model |
|---|---|---|
| Quick Test (5) | 25 total | 30-60 seconds |
| Medium Test (20) | 100 total | 2-5 minutes |
| Full Evaluation (300) | 1500 total | 15-30 minutes |
Model Recommendations
llama-3.3-70b-versatile (Best quality, slowest)
- Best for: Most accurate evaluations
- Speed: ~300ms per request
llama-3.1-8b-instant (Fast, good quality)
- Best for: Quick testing and validation
- Speed: ~100ms per request
mixtral-8x7b-32768 (Balanced)
- Best for: Production evaluations
- Speed: ~200ms per request
Troubleshooting
Issue: GROQ_API_KEY not found
Solution:
- Create
.envfile withGROQ_API_KEY=your_key - Or set environment variable:
$env:GROQ_API_KEY='your_key'
Issue: Rate limit errors
Solution:
- LLM client has built-in rate limiting (0.5s between requests)
- If still occurring, reduce
--max-samplesparameter - Use smaller models (llama-3.1-8b-instant) for testing
Issue: Out of memory
Solution:
- Run with smaller batch:
--max-samples 50 - Use faster model: llama-3.1-8b-instant
Issue: Type errors
Solution:
- Verify Python 3.10+ is being used
- Reinstall dependencies:
pip install -r requirements.txt
Validation Against Paper
After running evaluation, compare results with RGB benchmark paper Table 1-7:
Table 1: Noise Robustness (Page 8)
- Expected: Accuracy decreases as noise ratio increases (0% β 80%)
- Validate: Compare your accuracy curve with paper's
Table 2: Negative Rejection (Page 8)
- Expected: Rejection rate ~95%+ for good models
- Validate: Check rejection_rate metric
Table 3: Information Integration (Page 9)
- Expected: Accuracy 85%+ for good models
- Validate: Compare integrated answer accuracy
Table 4: Counterfactual Robustness (Page 9)
- Expected: Error detection 70%+, error correction 80%+
- Validate: Check both rates in results
Next Steps
- β
Setup complete: Create
.envwith Groq API key - β
Run quick test:
python run_evaluation.py --max-samples 5 - β Validate results match expected ranges
- β
Run full evaluation:
python run_evaluation.py --max-samples 300 - β Compare final results with paper's Table 1-7
Project Compliance Summary
β 4 RAG Abilities Implemented:
- Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
- Negative Rejection (exact phrase matching)
- Information Integration (multi-document synthesis)
- Counterfactual Robustness (error detection & correction)
β 3+ LLM Models Supported:
- llama-3.3-70b-versatile (highest quality)
- llama-3.1-8b-instant (fastest)
- mixtral-8x7b-32768 (balanced)
β Exact Figure 3 Prompt Format:
- System instruction specifying task behavior
- Input format: "Document:\n{documents}\n\nQuestion: {question}"
β All Required Metrics:
- Accuracy (for noise robustness, information integration)
- Rejection rate (for negative rejection)
- Error detection & correction rates (for counterfactual robustness)
β Type-Safe & Well-Tested:
- No type errors or warnings
- Comprehensive test suite included
- All components validated
Status: Ready for evaluation with Groq free API tier
For questions or issues, refer to:
COMPLIANCE_REVIEW_SUMMARY.md- Overview of changesDETAILED_CHANGES.md- Specific code modificationsREADME.md- Original project documentation