---
title: RGB RAG Evaluation Dashboard
emoji: 📊
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.28.0"
app_file: app.py
pinned: false
---

# RGB RAG Evaluation Project

A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.

## Project Overview

This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431):

1. **Noise Robustness**: Ability to handle noisy/irrelevant documents
2. **Negative Rejection**: Ability to reject answering when documents don't contain the answer
3. **Information Integration**: Ability to combine information from multiple documents
4. **Counterfactual Robustness**: Ability to detect and correct factual errors in documents

## Requirements

- Python 3.9+
- Groq API Key (free at https://console.groq.com/)

## Installation

1. **Clone and setup environment:**
   ```bash
   cd d:\CapStoneProject\RGB
   python -m venv .venv
   .venv\Scripts\activate  # Windows
   # source .venv/bin/activate  # Linux/Mac
   ```

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. **Set up Groq API Key:**
   ```bash
   # Copy example env file
   copy .env.example .env
   
   # Edit .env and add your Groq API key
   # Get free key at: https://console.groq.com/
   ```

4. **Download datasets:**
   ```bash
   python download_datasets.py
   ```

## Project Structure

```
RGB/
├── data/                       # Dataset files (downloaded)
│   ├── en_refine.json         # Noise robustness & negative rejection
│   ├── en_int.json            # Information integration
│   └── en_fact.json           # Counterfactual robustness
├── results/                    # Evaluation results
├── src/
│   ├── __init__.py
│   ├── config.py              # Configuration settings
│   ├── llm_client.py          # Groq LLM client
│   ├── data_loader.py         # RGB dataset loader
│   ├── evaluator.py           # Evaluation metrics
│   ├── prompts.py             # Prompt templates
│   └── pipeline.py            # Main evaluation pipeline
├── download_datasets.py        # Dataset downloader
├── run_evaluation.py          # Main entry point
├── requirements.txt           # Python dependencies
├── .env.example               # Environment variables template
└── README.md                  # This file
```

## Usage

### Run Full Evaluation

```bash
# Run with default settings (3 models, all tasks)
python run_evaluation.py

# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10

# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection

# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768
```

### Command Line Options

| Option | Description |
|--------|-------------|
| `-d, --data-dir` | Directory containing RGB datasets (default: `data`) |
| `-o, --output-dir` | Directory to save results (default: `results`) |
| `-m, --models` | Space-separated list of models to evaluate |
| `-n, --max-samples` | Maximum samples per task (for testing) |
| `-t, --tasks` | Specific tasks to run |

### Download Datasets

```bash
# Download all datasets
python download_datasets.py

# Force re-download
python download_datasets.py --force

# Verify existing datasets
python download_datasets.py --verify
```

## Evaluated Models

The project uses Groq's free LLM API with the following models (at least 3 as required):

1. **llama-3.3-70b-versatile** - Llama 3.3 70B (best quality)
2. **llama-3.1-8b-instant** - Llama 3.1 8B (fastest)
3. **mixtral-8x7b-32768** - Mixtral 8x7B (good balance)

Additional available models:
- `gemma2-9b-it` - Google's Gemma 2 9B

## Metrics

### Noise Robustness
- **Accuracy**: Percentage of correctly answered questions with noisy documents
- Breakdown by noise level (0-4 noise documents)

### Negative Rejection
- **Rejection Rate**: Percentage of questions correctly rejected when no answer exists

### Information Integration
- **Accuracy**: Percentage of correctly answered questions requiring info from multiple docs

### Counterfactual Robustness
- **Error Detection Rate**: Percentage of factual errors detected
- **Error Correction Rate**: Percentage of errors correctly corrected

## Output

Results are saved in the `results/` directory:
- `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format
- `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format

Example output:
```
================================================================================
RGB RAG EVALUATION RESULTS
================================================================================

--- NOISE ROBUSTNESS ---
Model                          Accuracy   Noise Level Breakdown
----------------------------------------------------------------------
llama-3.3-70b-versatile         85.50%   N0:92.0% | N1:88.0% | N2:84.0%

--- NEGATIVE REJECTION ---
Model                          Rejection Rate    Samples
------------------------------------------------------------
llama-3.3-70b-versatile             72.50%         100

--- INFORMATION INTEGRATION ---
Model                          Accuracy   Correct/Total
------------------------------------------------------------
llama-3.3-70b-versatile         78.00%    78/100

--- COUNTERFACTUAL ROBUSTNESS ---
Model                          Error Det.   Error Corr.
------------------------------------------------------------
llama-3.3-70b-versatile            65.00%       52.00%
```

## References

- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)

## License

This project is for educational purposes as part of a capstone project.