File size: 6,143 Bytes
ca5ddcb
 
 
 
 
 
 
 
 
 
 
af25c62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: RGB RAG Evaluation Dashboard
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.28.0"
app_file: app.py
pinned: false
---

# RGB RAG Evaluation Project

A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.

## Project Overview

This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431):

1. **Noise Robustness**: Ability to handle noisy/irrelevant documents
2. **Negative Rejection**: Ability to reject answering when documents don't contain the answer
3. **Information Integration**: Ability to combine information from multiple documents
4. **Counterfactual Robustness**: Ability to detect and correct factual errors in documents

## Requirements

- Python 3.9+
- Groq API Key (free at https://console.groq.com/)

## Installation

1. **Clone and setup environment:**
   ```bash
   cd d:\CapStoneProject\RGB
   python -m venv .venv
   .venv\Scripts\activate  # Windows
   # source .venv/bin/activate  # Linux/Mac
   ```

2. **Install dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

3. **Set up Groq API Key:**
   ```bash
   # Copy example env file
   copy .env.example .env
   
   # Edit .env and add your Groq API key
   # Get free key at: https://console.groq.com/
   ```

4. **Download datasets:**
   ```bash
   python download_datasets.py
   ```

## Project Structure

```
RGB/
β”œβ”€β”€ data/                       # Dataset files (downloaded)
β”‚   β”œβ”€β”€ en_refine.json         # Noise robustness & negative rejection
β”‚   β”œβ”€β”€ en_int.json            # Information integration
β”‚   └── en_fact.json           # Counterfactual robustness
β”œβ”€β”€ results/                    # Evaluation results
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py              # Configuration settings
β”‚   β”œβ”€β”€ llm_client.py          # Groq LLM client
β”‚   β”œβ”€β”€ data_loader.py         # RGB dataset loader
β”‚   β”œβ”€β”€ evaluator.py           # Evaluation metrics
β”‚   β”œβ”€β”€ prompts.py             # Prompt templates
β”‚   └── pipeline.py            # Main evaluation pipeline
β”œβ”€β”€ download_datasets.py        # Dataset downloader
β”œβ”€β”€ run_evaluation.py          # Main entry point
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ .env.example               # Environment variables template
└── README.md                  # This file
```

## Usage

### Run Full Evaluation

```bash
# Run with default settings (3 models, all tasks)
python run_evaluation.py

# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10

# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection

# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768
```

### Command Line Options

| Option | Description |
|--------|-------------|
| `-d, --data-dir` | Directory containing RGB datasets (default: `data`) |
| `-o, --output-dir` | Directory to save results (default: `results`) |
| `-m, --models` | Space-separated list of models to evaluate |
| `-n, --max-samples` | Maximum samples per task (for testing) |
| `-t, --tasks` | Specific tasks to run |

### Download Datasets

```bash
# Download all datasets
python download_datasets.py

# Force re-download
python download_datasets.py --force

# Verify existing datasets
python download_datasets.py --verify
```

## Evaluated Models

The project uses Groq's free LLM API with the following models (at least 3 as required):

1. **llama-3.3-70b-versatile** - Llama 3.3 70B (best quality)
2. **llama-3.1-8b-instant** - Llama 3.1 8B (fastest)
3. **mixtral-8x7b-32768** - Mixtral 8x7B (good balance)

Additional available models:
- `gemma2-9b-it` - Google's Gemma 2 9B

## Metrics

### Noise Robustness
- **Accuracy**: Percentage of correctly answered questions with noisy documents
- Breakdown by noise level (0-4 noise documents)

### Negative Rejection
- **Rejection Rate**: Percentage of questions correctly rejected when no answer exists

### Information Integration
- **Accuracy**: Percentage of correctly answered questions requiring info from multiple docs

### Counterfactual Robustness
- **Error Detection Rate**: Percentage of factual errors detected
- **Error Correction Rate**: Percentage of errors correctly corrected

## Output

Results are saved in the `results/` directory:
- `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format
- `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format

Example output:
```
================================================================================
RGB RAG EVALUATION RESULTS
================================================================================

--- NOISE ROBUSTNESS ---
Model                          Accuracy   Noise Level Breakdown
----------------------------------------------------------------------
llama-3.3-70b-versatile         85.50%   N0:92.0% | N1:88.0% | N2:84.0%

--- NEGATIVE REJECTION ---
Model                          Rejection Rate    Samples
------------------------------------------------------------
llama-3.3-70b-versatile             72.50%         100

--- INFORMATION INTEGRATION ---
Model                          Accuracy   Correct/Total
------------------------------------------------------------
llama-3.3-70b-versatile         78.00%    78/100

--- COUNTERFACTUAL ROBUSTNESS ---
Model                          Error Det.   Error Corr.
------------------------------------------------------------
llama-3.3-70b-versatile            65.00%       52.00%
```

## References

- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)

## License

This project is for educational purposes as part of a capstone project.