Spaces:
Sleeping
Sleeping
File size: 6,143 Bytes
ca5ddcb af25c62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | ---
title: RGB RAG Evaluation Dashboard
emoji: π
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.28.0"
app_file: app.py
pinned: false
---
# RGB RAG Evaluation Project
A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.
## Project Overview
This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431):
1. **Noise Robustness**: Ability to handle noisy/irrelevant documents
2. **Negative Rejection**: Ability to reject answering when documents don't contain the answer
3. **Information Integration**: Ability to combine information from multiple documents
4. **Counterfactual Robustness**: Ability to detect and correct factual errors in documents
## Requirements
- Python 3.9+
- Groq API Key (free at https://console.groq.com/)
## Installation
1. **Clone and setup environment:**
```bash
cd d:\CapStoneProject\RGB
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
```
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
3. **Set up Groq API Key:**
```bash
# Copy example env file
copy .env.example .env
# Edit .env and add your Groq API key
# Get free key at: https://console.groq.com/
```
4. **Download datasets:**
```bash
python download_datasets.py
```
## Project Structure
```
RGB/
βββ data/ # Dataset files (downloaded)
β βββ en_refine.json # Noise robustness & negative rejection
β βββ en_int.json # Information integration
β βββ en_fact.json # Counterfactual robustness
βββ results/ # Evaluation results
βββ src/
β βββ __init__.py
β βββ config.py # Configuration settings
β βββ llm_client.py # Groq LLM client
β βββ data_loader.py # RGB dataset loader
β βββ evaluator.py # Evaluation metrics
β βββ prompts.py # Prompt templates
β βββ pipeline.py # Main evaluation pipeline
βββ download_datasets.py # Dataset downloader
βββ run_evaluation.py # Main entry point
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ README.md # This file
```
## Usage
### Run Full Evaluation
```bash
# Run with default settings (3 models, all tasks)
python run_evaluation.py
# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10
# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection
# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768
```
### Command Line Options
| Option | Description |
|--------|-------------|
| `-d, --data-dir` | Directory containing RGB datasets (default: `data`) |
| `-o, --output-dir` | Directory to save results (default: `results`) |
| `-m, --models` | Space-separated list of models to evaluate |
| `-n, --max-samples` | Maximum samples per task (for testing) |
| `-t, --tasks` | Specific tasks to run |
### Download Datasets
```bash
# Download all datasets
python download_datasets.py
# Force re-download
python download_datasets.py --force
# Verify existing datasets
python download_datasets.py --verify
```
## Evaluated Models
The project uses Groq's free LLM API with the following models (at least 3 as required):
1. **llama-3.3-70b-versatile** - Llama 3.3 70B (best quality)
2. **llama-3.1-8b-instant** - Llama 3.1 8B (fastest)
3. **mixtral-8x7b-32768** - Mixtral 8x7B (good balance)
Additional available models:
- `gemma2-9b-it` - Google's Gemma 2 9B
## Metrics
### Noise Robustness
- **Accuracy**: Percentage of correctly answered questions with noisy documents
- Breakdown by noise level (0-4 noise documents)
### Negative Rejection
- **Rejection Rate**: Percentage of questions correctly rejected when no answer exists
### Information Integration
- **Accuracy**: Percentage of correctly answered questions requiring info from multiple docs
### Counterfactual Robustness
- **Error Detection Rate**: Percentage of factual errors detected
- **Error Correction Rate**: Percentage of errors correctly corrected
## Output
Results are saved in the `results/` directory:
- `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format
- `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format
Example output:
```
================================================================================
RGB RAG EVALUATION RESULTS
================================================================================
--- NOISE ROBUSTNESS ---
Model Accuracy Noise Level Breakdown
----------------------------------------------------------------------
llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0%
--- NEGATIVE REJECTION ---
Model Rejection Rate Samples
------------------------------------------------------------
llama-3.3-70b-versatile 72.50% 100
--- INFORMATION INTEGRATION ---
Model Accuracy Correct/Total
------------------------------------------------------------
llama-3.3-70b-versatile 78.00% 78/100
--- COUNTERFACTUAL ROBUSTNESS ---
Model Error Det. Error Corr.
------------------------------------------------------------
llama-3.3-70b-versatile 65.00% 52.00%
```
## References
- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)
## License
This project is for educational purposes as part of a capstone project.
|