Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / SETUP_AND_EXECUTION_GUIDE.md

RGB Evaluation

feat: Add separate grid layout for 4 RAG abilities in Streamlit UI

af25c62 about 1 month ago

preview code

raw

history blame contribute delete

8.95 kB

	# RGB RAG Evaluation Pipeline - Setup & Execution Guide

	## Project Structure

	```
	d:\CapStoneProject\RGB\
	├── src/
	│ ├── __init__.py
	│ ├── llm_client.py # Groq API client with rate limiting
	│ ├── data_loader.py # RGB dataset loader (4 tasks)
	│ ├── evaluator.py # Metric calculations (accuracy, rejection, error rates)
	│ ├── prompts.py # Figure 3 prompt templates
	│ ├── pipeline.py # Main evaluation pipeline
	│ └── config.py # Configuration constants
	├── data/
	│ ├── en_refine.json # Noise robustness + negative rejection data (7.2 MB, 300 samples)
	│ ├── en_int.json # Information integration data (5.1 MB, 100 samples)
	│ └── en_fact.json # Counterfactual robustness data (241 KB, 100 samples)
	├── .venv/ # Python 3.13.1 virtual environment
	├── requirements.txt # Dependencies (groq, tqdm, requests, pandas, pymupdf)
	├── .env.example # Template for GROQ_API_KEY
	├── quick_test.py # Quick connectivity test
	├── run_evaluation.py # Main evaluation script
	├── download_datasets.py # Dataset download script (already executed)
	├── test_refactored_pipeline.py # Comprehensive test suite
	├── COMPLIANCE_REVIEW_SUMMARY.md # Review of all changes made
	├── DETAILED_CHANGES.md # Detailed code changes per file
	└── README.md # Original project documentation
	```

	## Setup Instructions

	### Step 1: Get Groq API Key
	1. Visit https://console.groq.com/keys
	2. Sign up (free tier available)
	3. Create API key for your account
	4. Copy the key to your clipboard

	### Step 2: Create .env File
	Create file: `d:\CapStoneProject\RGB\.env`

	```
	GROQ_API_KEY=your_api_key_here
	```

	Replace `your_api_key_here` with your actual Groq API key.

	### Step 3: Verify Setup
	Run quick test to verify everything is connected:

	```bash
	cd d:\CapStoneProject\RGB
	.venv\Scripts\python.exe quick_test.py
	```

	Expected output:
	```
	✓ GROQ_API_KEY found
	✓ Groq client initialized
	✓ Testing LLM connectivity...
	✓ Models available: 3
	✓ Ready for evaluation!
	```

	## Evaluation Execution

	### Option 1: Quick Test (5 samples per task)
	For testing the complete pipeline:

	```bash
	.venv\Scripts\python.exe run_evaluation.py --max-samples 5
	```

	This will:
	- Test all 4 RAG abilities
	- Use the 3 default models
	- Take ~10-20 seconds per model
	- Output results to `results/evaluation_results.json`

	### Option 2: Medium Test (20 samples per task)
	For validating results:

	```bash
	.venv\Scripts\python.exe run_evaluation.py --max-samples 20
	```

	This will:
	- Test all 4 RAG abilities
	- Use 20 samples per noise ratio (100+ total for noise robustness)
	- Take 2-5 minutes per model
	- Generate reliable metrics

	### Option 3: Full Evaluation (300 samples per task)
	For final results matching paper:

	```bash
	.venv\Scripts\python.exe run_evaluation.py --max-samples 300
	```

	This will:
	- Use all available samples
	- Test at 5 noise ratios with full data
	- Take 10-30 minutes per model
	- Match paper's Table 1-7 methodology

	## Understanding the Results

	### Output File: results/evaluation_results.json

	Example structure:
	```json
	{
	"timestamp": "2024-01-15 10:30:45",
	"models": ["llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"],
	"results": [
	{
	"task_type": "noise_robustness_0%",
	"model_name": "llama-3.3-70b-versatile",
	"total_samples": 300,
	"correct": 285,
	"incorrect": 15,
	"accuracy": 95.00,
	"rejected": 0,
	"rejection_rate": 0.00
	},
	{
	"task_type": "noise_robustness_20%",
	"model_name": "llama-3.3-70b-versatile",
	"total_samples": 300,
	"correct": 265,
	"incorrect": 35,
	"accuracy": 88.33,
	"rejected": 0,
	"rejection_rate": 0.00
	},
	// ... more results for each noise ratio and task
	{
	"task_type": "negative_rejection",
	"model_name": "llama-3.3-70b-versatile",
	"total_samples": 300,
	"rejected": 285,
	"rejection_rate": 95.00
	},
	{
	"task_type": "counterfactual_robustness",
	"model_name": "llama-3.3-70b-versatile",
	"total_samples": 100,
	"error_detection_count": 85,
	"error_detection_rate": 85.00,
	"error_correction_count": 80,
	"error_correction_rate": 80.00
	}
	]
	}
	```

	### Metric Definitions

	Accuracy (%): Percentage of correct answers
	- Used for: Noise Robustness, Information Integration
	- Formula: (correct_answers / total_samples) × 100

	Rejection Rate (%): Percentage of appropriate rejections
	- Used for: Negative Rejection
	- Formula: (rejected_responses / total_samples) × 100
	- Expected: ~95%+ for good models

	Error Detection Rate (%): Percentage of factual errors detected
	- Used for: Counterfactual Robustness
	- Formula: (detected_errors / total_samples) × 100
	- Expected: 70%+ for good models

	Error Correction Rate (%): Percentage of detected errors corrected
	- Used for: Counterfactual Robustness
	- Formula: (corrected_errors / detected_errors) × 100
	- Expected: 90%+ for good models

	## Performance Notes

	### Groq Free Tier Limits
	- API rate limit: ~30 requests per minute
	- Response time: 100-500ms per request
	- Models available: llama-3.3-70b, llama-3.1-8b-instant, mixtral-8x7b

	### Estimated Evaluation Times
	\| Task \| Samples \| Time per Model \|
	\|------\|---------\|----------------\|
	\| Quick Test (5) \| 25 total \| 30-60 seconds \|
	\| Medium Test (20) \| 100 total \| 2-5 minutes \|
	\| Full Evaluation (300) \| 1500 total \| 15-30 minutes \|

	### Model Recommendations
	1. llama-3.3-70b-versatile (Best quality, slowest)
	- Best for: Most accurate evaluations
	- Speed: ~300ms per request

	2. llama-3.1-8b-instant (Fast, good quality)
	- Best for: Quick testing and validation
	- Speed: ~100ms per request

	3. mixtral-8x7b-32768 (Balanced)
	- Best for: Production evaluations
	- Speed: ~200ms per request

	## Troubleshooting

	### Issue: GROQ_API_KEY not found
	Solution:
	- Create `.env` file with `GROQ_API_KEY=your_key`
	- Or set environment variable: `$env:GROQ_API_KEY='your_key'`

	### Issue: Rate limit errors
	Solution:
	- LLM client has built-in rate limiting (0.5s between requests)
	- If still occurring, reduce `--max-samples` parameter
	- Use smaller models (llama-3.1-8b-instant) for testing

	### Issue: Out of memory
	Solution:
	- Run with smaller batch: `--max-samples 50`
	- Use faster model: llama-3.1-8b-instant

	### Issue: Type errors
	Solution:
	- Verify Python 3.10+ is being used
	- Reinstall dependencies: `pip install -r requirements.txt`

	## Validation Against Paper

	After running evaluation, compare results with RGB benchmark paper Table 1-7:

	Table 1: Noise Robustness (Page 8)
	- Expected: Accuracy decreases as noise ratio increases (0% → 80%)
	- Validate: Compare your accuracy curve with paper's

	Table 2: Negative Rejection (Page 8)
	- Expected: Rejection rate ~95%+ for good models
	- Validate: Check rejection_rate metric

	Table 3: Information Integration (Page 9)
	- Expected: Accuracy 85%+ for good models
	- Validate: Compare integrated answer accuracy

	Table 4: Counterfactual Robustness (Page 9)
	- Expected: Error detection 70%+, error correction 80%+
	- Validate: Check both rates in results

	## Next Steps

	1. ✅ Setup complete: Create `.env` with Groq API key
	2. ✅ Run quick test: `python run_evaluation.py --max-samples 5`
	3. ✅ Validate results match expected ranges
	4. ✅ Run full evaluation: `python run_evaluation.py --max-samples 300`
	5. ✅ Compare final results with paper's Table 1-7

	## Project Compliance Summary

	✅ 4 RAG Abilities Implemented:
	- Noise Robustness (5 noise ratios: 0%, 20%, 40%, 60%, 80%)
	- Negative Rejection (exact phrase matching)
	- Information Integration (multi-document synthesis)
	- Counterfactual Robustness (error detection & correction)

	✅ 3+ LLM Models Supported:
	- llama-3.3-70b-versatile (highest quality)
	- llama-3.1-8b-instant (fastest)
	- mixtral-8x7b-32768 (balanced)

	✅ Exact Figure 3 Prompt Format:
	- System instruction specifying task behavior
	- Input format: "Document:\n{documents}\n\nQuestion: {question}"

	✅ All Required Metrics:
	- Accuracy (for noise robustness, information integration)
	- Rejection rate (for negative rejection)
	- Error detection & correction rates (for counterfactual robustness)

	✅ Type-Safe & Well-Tested:
	- No type errors or warnings
	- Comprehensive test suite included
	- All components validated

	---

	Status: Ready for evaluation with Groq free API tier

	For questions or issues, refer to:
	- `COMPLIANCE_REVIEW_SUMMARY.md` - Overview of changes
	- `DETAILED_CHANGES.md` - Specific code modifications
	- `README.md` - Original project documentation