Spaces:
Sleeping
Sleeping
| title: RGB RAG Evaluation Dashboard | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: streamlit | |
| sdk_version: "1.28.0" | |
| app_file: app.py | |
| pinned: false | |
| # RGB RAG Evaluation Project | |
| A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API. | |
| ## Project Overview | |
| This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431): | |
| 1. **Noise Robustness**: Ability to handle noisy/irrelevant documents | |
| 2. **Negative Rejection**: Ability to reject answering when documents don't contain the answer | |
| 3. **Information Integration**: Ability to combine information from multiple documents | |
| 4. **Counterfactual Robustness**: Ability to detect and correct factual errors in documents | |
| ## Requirements | |
| - Python 3.9+ | |
| - Groq API Key (free at https://console.groq.com/) | |
| ## Installation | |
| 1. **Clone and setup environment:** | |
| ```bash | |
| cd d:\CapStoneProject\RGB | |
| python -m venv .venv | |
| .venv\Scripts\activate # Windows | |
| # source .venv/bin/activate # Linux/Mac | |
| ``` | |
| 2. **Install dependencies:** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Set up Groq API Key:** | |
| ```bash | |
| # Copy example env file | |
| copy .env.example .env | |
| # Edit .env and add your Groq API key | |
| # Get free key at: https://console.groq.com/ | |
| ``` | |
| 4. **Download datasets:** | |
| ```bash | |
| python download_datasets.py | |
| ``` | |
| ## Project Structure | |
| ``` | |
| RGB/ | |
| βββ data/ # Dataset files (downloaded) | |
| β βββ en_refine.json # Noise robustness & negative rejection | |
| β βββ en_int.json # Information integration | |
| β βββ en_fact.json # Counterfactual robustness | |
| βββ results/ # Evaluation results | |
| βββ src/ | |
| β βββ __init__.py | |
| β βββ config.py # Configuration settings | |
| β βββ llm_client.py # Groq LLM client | |
| β βββ data_loader.py # RGB dataset loader | |
| β βββ evaluator.py # Evaluation metrics | |
| β βββ prompts.py # Prompt templates | |
| β βββ pipeline.py # Main evaluation pipeline | |
| βββ download_datasets.py # Dataset downloader | |
| βββ run_evaluation.py # Main entry point | |
| βββ requirements.txt # Python dependencies | |
| βββ .env.example # Environment variables template | |
| βββ README.md # This file | |
| ``` | |
| ## Usage | |
| ### Run Full Evaluation | |
| ```bash | |
| # Run with default settings (3 models, all tasks) | |
| python run_evaluation.py | |
| # Specify number of samples (for quick testing) | |
| python run_evaluation.py --max-samples 10 | |
| # Run specific tasks only | |
| python run_evaluation.py --tasks noise_robustness negative_rejection | |
| # Use specific models | |
| python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768 | |
| ``` | |
| ### Command Line Options | |
| | Option | Description | | |
| |--------|-------------| | |
| | `-d, --data-dir` | Directory containing RGB datasets (default: `data`) | | |
| | `-o, --output-dir` | Directory to save results (default: `results`) | | |
| | `-m, --models` | Space-separated list of models to evaluate | | |
| | `-n, --max-samples` | Maximum samples per task (for testing) | | |
| | `-t, --tasks` | Specific tasks to run | | |
| ### Download Datasets | |
| ```bash | |
| # Download all datasets | |
| python download_datasets.py | |
| # Force re-download | |
| python download_datasets.py --force | |
| # Verify existing datasets | |
| python download_datasets.py --verify | |
| ``` | |
| ## Evaluated Models | |
| The project uses Groq's free LLM API with the following models (at least 3 as required): | |
| 1. **llama-3.3-70b-versatile** - Llama 3.3 70B (best quality) | |
| 2. **llama-3.1-8b-instant** - Llama 3.1 8B (fastest) | |
| 3. **mixtral-8x7b-32768** - Mixtral 8x7B (good balance) | |
| Additional available models: | |
| - `gemma2-9b-it` - Google's Gemma 2 9B | |
| ## Metrics | |
| ### Noise Robustness | |
| - **Accuracy**: Percentage of correctly answered questions with noisy documents | |
| - Breakdown by noise level (0-4 noise documents) | |
| ### Negative Rejection | |
| - **Rejection Rate**: Percentage of questions correctly rejected when no answer exists | |
| ### Information Integration | |
| - **Accuracy**: Percentage of correctly answered questions requiring info from multiple docs | |
| ### Counterfactual Robustness | |
| - **Error Detection Rate**: Percentage of factual errors detected | |
| - **Error Correction Rate**: Percentage of errors correctly corrected | |
| ## Output | |
| Results are saved in the `results/` directory: | |
| - `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format | |
| - `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format | |
| Example output: | |
| ``` | |
| ================================================================================ | |
| RGB RAG EVALUATION RESULTS | |
| ================================================================================ | |
| --- NOISE ROBUSTNESS --- | |
| Model Accuracy Noise Level Breakdown | |
| ---------------------------------------------------------------------- | |
| llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0% | |
| --- NEGATIVE REJECTION --- | |
| Model Rejection Rate Samples | |
| ------------------------------------------------------------ | |
| llama-3.3-70b-versatile 72.50% 100 | |
| --- INFORMATION INTEGRATION --- | |
| Model Accuracy Correct/Total | |
| ------------------------------------------------------------ | |
| llama-3.3-70b-versatile 78.00% 78/100 | |
| --- COUNTERFACTUAL ROBUSTNESS --- | |
| Model Error Det. Error Corr. | |
| ------------------------------------------------------------ | |
| llama-3.3-70b-versatile 65.00% 52.00% | |
| ``` | |
| ## References | |
| - Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431) | |
| - RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB) | |
| - Groq API: [https://console.groq.com/](https://console.groq.com/) | |
| ## License | |
| This project is for educational purposes as part of a capstone project. | |