RGBMetrics / README.md
RGB Evaluation
fix: Add HF Spaces YAML configuration to README.md
ca5ddcb
---
title: RGB RAG Evaluation Dashboard
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.28.0"
app_file: app.py
pinned: false
---
# RGB RAG Evaluation Project
A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.
## Project Overview
This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431):
1. **Noise Robustness**: Ability to handle noisy/irrelevant documents
2. **Negative Rejection**: Ability to reject answering when documents don't contain the answer
3. **Information Integration**: Ability to combine information from multiple documents
4. **Counterfactual Robustness**: Ability to detect and correct factual errors in documents
## Requirements
- Python 3.9+
- Groq API Key (free at https://console.groq.com/)
## Installation
1. **Clone and setup environment:**
```bash
cd d:\CapStoneProject\RGB
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
```
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
3. **Set up Groq API Key:**
```bash
# Copy example env file
copy .env.example .env
# Edit .env and add your Groq API key
# Get free key at: https://console.groq.com/
```
4. **Download datasets:**
```bash
python download_datasets.py
```
## Project Structure
```
RGB/
β”œβ”€β”€ data/ # Dataset files (downloaded)
β”‚ β”œβ”€β”€ en_refine.json # Noise robustness & negative rejection
β”‚ β”œβ”€β”€ en_int.json # Information integration
β”‚ └── en_fact.json # Counterfactual robustness
β”œβ”€β”€ results/ # Evaluation results
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ config.py # Configuration settings
β”‚ β”œβ”€β”€ llm_client.py # Groq LLM client
β”‚ β”œβ”€β”€ data_loader.py # RGB dataset loader
β”‚ β”œβ”€β”€ evaluator.py # Evaluation metrics
β”‚ β”œβ”€β”€ prompts.py # Prompt templates
β”‚ └── pipeline.py # Main evaluation pipeline
β”œβ”€β”€ download_datasets.py # Dataset downloader
β”œβ”€β”€ run_evaluation.py # Main entry point
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ .env.example # Environment variables template
└── README.md # This file
```
## Usage
### Run Full Evaluation
```bash
# Run with default settings (3 models, all tasks)
python run_evaluation.py
# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10
# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection
# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768
```
### Command Line Options
| Option | Description |
|--------|-------------|
| `-d, --data-dir` | Directory containing RGB datasets (default: `data`) |
| `-o, --output-dir` | Directory to save results (default: `results`) |
| `-m, --models` | Space-separated list of models to evaluate |
| `-n, --max-samples` | Maximum samples per task (for testing) |
| `-t, --tasks` | Specific tasks to run |
### Download Datasets
```bash
# Download all datasets
python download_datasets.py
# Force re-download
python download_datasets.py --force
# Verify existing datasets
python download_datasets.py --verify
```
## Evaluated Models
The project uses Groq's free LLM API with the following models (at least 3 as required):
1. **llama-3.3-70b-versatile** - Llama 3.3 70B (best quality)
2. **llama-3.1-8b-instant** - Llama 3.1 8B (fastest)
3. **mixtral-8x7b-32768** - Mixtral 8x7B (good balance)
Additional available models:
- `gemma2-9b-it` - Google's Gemma 2 9B
## Metrics
### Noise Robustness
- **Accuracy**: Percentage of correctly answered questions with noisy documents
- Breakdown by noise level (0-4 noise documents)
### Negative Rejection
- **Rejection Rate**: Percentage of questions correctly rejected when no answer exists
### Information Integration
- **Accuracy**: Percentage of correctly answered questions requiring info from multiple docs
### Counterfactual Robustness
- **Error Detection Rate**: Percentage of factual errors detected
- **Error Correction Rate**: Percentage of errors correctly corrected
## Output
Results are saved in the `results/` directory:
- `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format
- `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format
Example output:
```
================================================================================
RGB RAG EVALUATION RESULTS
================================================================================
--- NOISE ROBUSTNESS ---
Model Accuracy Noise Level Breakdown
----------------------------------------------------------------------
llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0%
--- NEGATIVE REJECTION ---
Model Rejection Rate Samples
------------------------------------------------------------
llama-3.3-70b-versatile 72.50% 100
--- INFORMATION INTEGRATION ---
Model Accuracy Correct/Total
------------------------------------------------------------
llama-3.3-70b-versatile 78.00% 78/100
--- COUNTERFACTUAL ROBUSTNESS ---
Model Error Det. Error Corr.
------------------------------------------------------------
llama-3.3-70b-versatile 65.00% 52.00%
```
## References
- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)
## License
This project is for educational purposes as part of a capstone project.