Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.54.0
title: RGB RAG Evaluation Dashboard
emoji: π
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
RGB RAG Evaluation Project
A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.
Project Overview
This project evaluates four key RAG abilities as defined in the research paper Benchmarking Large Language Models in Retrieval-Augmented Generation:
- Noise Robustness: Ability to handle noisy/irrelevant documents
- Negative Rejection: Ability to reject answering when documents don't contain the answer
- Information Integration: Ability to combine information from multiple documents
- Counterfactual Robustness: Ability to detect and correct factual errors in documents
Requirements
- Python 3.9+
- Groq API Key (free at https://console.groq.com/)
Installation
Clone and setup environment:
cd d:\CapStoneProject\RGB python -m venv .venv .venv\Scripts\activate # Windows # source .venv/bin/activate # Linux/MacInstall dependencies:
pip install -r requirements.txtSet up Groq API Key:
# Copy example env file copy .env.example .env # Edit .env and add your Groq API key # Get free key at: https://console.groq.com/Download datasets:
python download_datasets.py
Project Structure
RGB/
βββ data/ # Dataset files (downloaded)
β βββ en_refine.json # Noise robustness & negative rejection
β βββ en_int.json # Information integration
β βββ en_fact.json # Counterfactual robustness
βββ results/ # Evaluation results
βββ src/
β βββ __init__.py
β βββ config.py # Configuration settings
β βββ llm_client.py # Groq LLM client
β βββ data_loader.py # RGB dataset loader
β βββ evaluator.py # Evaluation metrics
β βββ prompts.py # Prompt templates
β βββ pipeline.py # Main evaluation pipeline
βββ download_datasets.py # Dataset downloader
βββ run_evaluation.py # Main entry point
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ README.md # This file
Usage
Run Full Evaluation
# Run with default settings (3 models, all tasks)
python run_evaluation.py
# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10
# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection
# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768
Command Line Options
| Option | Description |
|---|---|
-d, --data-dir |
Directory containing RGB datasets (default: data) |
-o, --output-dir |
Directory to save results (default: results) |
-m, --models |
Space-separated list of models to evaluate |
-n, --max-samples |
Maximum samples per task (for testing) |
-t, --tasks |
Specific tasks to run |
Download Datasets
# Download all datasets
python download_datasets.py
# Force re-download
python download_datasets.py --force
# Verify existing datasets
python download_datasets.py --verify
Evaluated Models
The project uses Groq's free LLM API with the following models (at least 3 as required):
- llama-3.3-70b-versatile - Llama 3.3 70B (best quality)
- llama-3.1-8b-instant - Llama 3.1 8B (fastest)
- mixtral-8x7b-32768 - Mixtral 8x7B (good balance)
Additional available models:
gemma2-9b-it- Google's Gemma 2 9B
Metrics
Noise Robustness
- Accuracy: Percentage of correctly answered questions with noisy documents
- Breakdown by noise level (0-4 noise documents)
Negative Rejection
- Rejection Rate: Percentage of questions correctly rejected when no answer exists
Information Integration
- Accuracy: Percentage of correctly answered questions requiring info from multiple docs
Counterfactual Robustness
- Error Detection Rate: Percentage of factual errors detected
- Error Correction Rate: Percentage of errors correctly corrected
Output
Results are saved in the results/ directory:
results_YYYYMMDD_HHMMSS.json- Full results in JSON formatsummary_YYYYMMDD_HHMMSS.csv- Summary table in CSV format
Example output: ```
RGB RAG EVALUATION RESULTS
--- NOISE ROBUSTNESS --- Model Accuracy Noise Level Breakdown
llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0%
--- NEGATIVE REJECTION --- Model Rejection Rate Samples
llama-3.3-70b-versatile 72.50% 100
--- INFORMATION INTEGRATION --- Model Accuracy Correct/Total
llama-3.3-70b-versatile 78.00% 78/100
--- COUNTERFACTUAL ROBUSTNESS --- Model Error Det. Error Corr.
llama-3.3-70b-versatile 65.00% 52.00%
## References
- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)
## License
This project is for educational purposes as part of a capstone project.