--- title: RGB RAG Evaluation Dashboard emoji: 📊 colorFrom: blue colorTo: green sdk: streamlit sdk_version: "1.28.0" app_file: app.py pinned: false --- # RGB RAG Evaluation Project A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API. ## Project Overview This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431): 1. **Noise Robustness**: Ability to handle noisy/irrelevant documents 2. **Negative Rejection**: Ability to reject answering when documents don't contain the answer 3. **Information Integration**: Ability to combine information from multiple documents 4. **Counterfactual Robustness**: Ability to detect and correct factual errors in documents ## Requirements - Python 3.9+ - Groq API Key (free at https://console.groq.com/) ## Installation 1. **Clone and setup environment:** ```bash cd d:\CapStoneProject\RGB python -m venv .venv .venv\Scripts\activate # Windows # source .venv/bin/activate # Linux/Mac ``` 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` 3. **Set up Groq API Key:** ```bash # Copy example env file copy .env.example .env # Edit .env and add your Groq API key # Get free key at: https://console.groq.com/ ``` 4. **Download datasets:** ```bash python download_datasets.py ``` ## Project Structure ``` RGB/ ├── data/ # Dataset files (downloaded) │ ├── en_refine.json # Noise robustness & negative rejection │ ├── en_int.json # Information integration │ └── en_fact.json # Counterfactual robustness ├── results/ # Evaluation results ├── src/ │ ├── __init__.py │ ├── config.py # Configuration settings │ ├── llm_client.py # Groq LLM client │ ├── data_loader.py # RGB dataset loader │ ├── evaluator.py # Evaluation metrics │ ├── prompts.py # Prompt templates │ └── pipeline.py # Main evaluation pipeline ├── download_datasets.py # Dataset downloader ├── run_evaluation.py # Main entry point ├── requirements.txt # Python dependencies ├── .env.example # Environment variables template └── README.md # This file ``` ## Usage ### Run Full Evaluation ```bash # Run with default settings (3 models, all tasks) python run_evaluation.py # Specify number of samples (for quick testing) python run_evaluation.py --max-samples 10 # Run specific tasks only python run_evaluation.py --tasks noise_robustness negative_rejection # Use specific models python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768 ``` ### Command Line Options | Option | Description | |--------|-------------| | `-d, --data-dir` | Directory containing RGB datasets (default: `data`) | | `-o, --output-dir` | Directory to save results (default: `results`) | | `-m, --models` | Space-separated list of models to evaluate | | `-n, --max-samples` | Maximum samples per task (for testing) | | `-t, --tasks` | Specific tasks to run | ### Download Datasets ```bash # Download all datasets python download_datasets.py # Force re-download python download_datasets.py --force # Verify existing datasets python download_datasets.py --verify ``` ## Evaluated Models The project uses Groq's free LLM API with the following models (at least 3 as required): 1. **llama-3.3-70b-versatile** - Llama 3.3 70B (best quality) 2. **llama-3.1-8b-instant** - Llama 3.1 8B (fastest) 3. **mixtral-8x7b-32768** - Mixtral 8x7B (good balance) Additional available models: - `gemma2-9b-it` - Google's Gemma 2 9B ## Metrics ### Noise Robustness - **Accuracy**: Percentage of correctly answered questions with noisy documents - Breakdown by noise level (0-4 noise documents) ### Negative Rejection - **Rejection Rate**: Percentage of questions correctly rejected when no answer exists ### Information Integration - **Accuracy**: Percentage of correctly answered questions requiring info from multiple docs ### Counterfactual Robustness - **Error Detection Rate**: Percentage of factual errors detected - **Error Correction Rate**: Percentage of errors correctly corrected ## Output Results are saved in the `results/` directory: - `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format - `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format Example output: ``` ================================================================================ RGB RAG EVALUATION RESULTS ================================================================================ --- NOISE ROBUSTNESS --- Model Accuracy Noise Level Breakdown ---------------------------------------------------------------------- llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0% --- NEGATIVE REJECTION --- Model Rejection Rate Samples ------------------------------------------------------------ llama-3.3-70b-versatile 72.50% 100 --- INFORMATION INTEGRATION --- Model Accuracy Correct/Total ------------------------------------------------------------ llama-3.3-70b-versatile 78.00% 78/100 --- COUNTERFACTUAL ROBUSTNESS --- Model Error Det. Error Corr. ------------------------------------------------------------ llama-3.3-70b-versatile 65.00% 52.00% ``` ## References - Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431) - RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB) - Groq API: [https://console.groq.com/](https://console.groq.com/) ## License This project is for educational purposes as part of a capstone project.