RGBMetrics / README.md
RGB Evaluation
fix: Add HF Spaces YAML configuration to README.md
ca5ddcb

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade
metadata
title: RGB RAG Evaluation Dashboard
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false

RGB RAG Evaluation Project

A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.

Project Overview

This project evaluates four key RAG abilities as defined in the research paper Benchmarking Large Language Models in Retrieval-Augmented Generation:

  1. Noise Robustness: Ability to handle noisy/irrelevant documents
  2. Negative Rejection: Ability to reject answering when documents don't contain the answer
  3. Information Integration: Ability to combine information from multiple documents
  4. Counterfactual Robustness: Ability to detect and correct factual errors in documents

Requirements

Installation

  1. Clone and setup environment:

    cd d:\CapStoneProject\RGB
    python -m venv .venv
    .venv\Scripts\activate  # Windows
    # source .venv/bin/activate  # Linux/Mac
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up Groq API Key:

    # Copy example env file
    copy .env.example .env
    
    # Edit .env and add your Groq API key
    # Get free key at: https://console.groq.com/
    
  4. Download datasets:

    python download_datasets.py
    

Project Structure

RGB/
β”œβ”€β”€ data/                       # Dataset files (downloaded)
β”‚   β”œβ”€β”€ en_refine.json         # Noise robustness & negative rejection
β”‚   β”œβ”€β”€ en_int.json            # Information integration
β”‚   └── en_fact.json           # Counterfactual robustness
β”œβ”€β”€ results/                    # Evaluation results
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py              # Configuration settings
β”‚   β”œβ”€β”€ llm_client.py          # Groq LLM client
β”‚   β”œβ”€β”€ data_loader.py         # RGB dataset loader
β”‚   β”œβ”€β”€ evaluator.py           # Evaluation metrics
β”‚   β”œβ”€β”€ prompts.py             # Prompt templates
β”‚   └── pipeline.py            # Main evaluation pipeline
β”œβ”€β”€ download_datasets.py        # Dataset downloader
β”œβ”€β”€ run_evaluation.py          # Main entry point
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ .env.example               # Environment variables template
└── README.md                  # This file

Usage

Run Full Evaluation

# Run with default settings (3 models, all tasks)
python run_evaluation.py

# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10

# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection

# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768

Command Line Options

Option Description
-d, --data-dir Directory containing RGB datasets (default: data)
-o, --output-dir Directory to save results (default: results)
-m, --models Space-separated list of models to evaluate
-n, --max-samples Maximum samples per task (for testing)
-t, --tasks Specific tasks to run

Download Datasets

# Download all datasets
python download_datasets.py

# Force re-download
python download_datasets.py --force

# Verify existing datasets
python download_datasets.py --verify

Evaluated Models

The project uses Groq's free LLM API with the following models (at least 3 as required):

  1. llama-3.3-70b-versatile - Llama 3.3 70B (best quality)
  2. llama-3.1-8b-instant - Llama 3.1 8B (fastest)
  3. mixtral-8x7b-32768 - Mixtral 8x7B (good balance)

Additional available models:

  • gemma2-9b-it - Google's Gemma 2 9B

Metrics

Noise Robustness

  • Accuracy: Percentage of correctly answered questions with noisy documents
  • Breakdown by noise level (0-4 noise documents)

Negative Rejection

  • Rejection Rate: Percentage of questions correctly rejected when no answer exists

Information Integration

  • Accuracy: Percentage of correctly answered questions requiring info from multiple docs

Counterfactual Robustness

  • Error Detection Rate: Percentage of factual errors detected
  • Error Correction Rate: Percentage of errors correctly corrected

Output

Results are saved in the results/ directory:

  • results_YYYYMMDD_HHMMSS.json - Full results in JSON format
  • summary_YYYYMMDD_HHMMSS.csv - Summary table in CSV format

Example output: ```

RGB RAG EVALUATION RESULTS

--- NOISE ROBUSTNESS --- Model Accuracy Noise Level Breakdown

llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0%

--- NEGATIVE REJECTION --- Model Rejection Rate Samples

llama-3.3-70b-versatile 72.50% 100

--- INFORMATION INTEGRATION --- Model Accuracy Correct/Total

llama-3.3-70b-versatile 78.00% 78/100

--- COUNTERFACTUAL ROBUSTNESS --- Model Error Det. Error Corr.

llama-3.3-70b-versatile 65.00% 52.00%


## References

- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)

## License

This project is for educational purposes as part of a capstone project.