Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / README.md

RGB Evaluation

fix: Add HF Spaces YAML configuration to README.md

ca5ddcb about 2 months ago

preview code

raw

history blame contribute delete

6.14 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

metadata

title: RGB RAG Evaluation Dashboard
emoji: 📊
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false

RGB RAG Evaluation Project

A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.

Project Overview

This project evaluates four key RAG abilities as defined in the research paper Benchmarking Large Language Models in Retrieval-Augmented Generation:

Noise Robustness: Ability to handle noisy/irrelevant documents
Negative Rejection: Ability to reject answering when documents don't contain the answer
Information Integration: Ability to combine information from multiple documents
Counterfactual Robustness: Ability to detect and correct factual errors in documents

Requirements

Python 3.9+
Groq API Key (free at https://console.groq.com/)

Installation

Clone and setup environment:

cd d:\CapStoneProject\RGB
python -m venv .venv
.venv\Scripts\activate  # Windows
# source .venv/bin/activate  # Linux/Mac

Install dependencies:
```
pip install -r requirements.txt
```

Set up Groq API Key:

# Copy example env file
copy .env.example .env

# Edit .env and add your Groq API key
# Get free key at: https://console.groq.com/

Download datasets:
```
python download_datasets.py
```

Project Structure

RGB/
├── data/                       # Dataset files (downloaded)
│   ├── en_refine.json         # Noise robustness & negative rejection
│   ├── en_int.json            # Information integration
│   └── en_fact.json           # Counterfactual robustness
├── results/                    # Evaluation results
├── src/
│   ├── __init__.py
│   ├── config.py              # Configuration settings
│   ├── llm_client.py          # Groq LLM client
│   ├── data_loader.py         # RGB dataset loader
│   ├── evaluator.py           # Evaluation metrics
│   ├── prompts.py             # Prompt templates
│   └── pipeline.py            # Main evaluation pipeline
├── download_datasets.py        # Dataset downloader
├── run_evaluation.py          # Main entry point
├── requirements.txt           # Python dependencies
├── .env.example               # Environment variables template
└── README.md                  # This file

Usage

Run Full Evaluation

# Run with default settings (3 models, all tasks)
python run_evaluation.py

# Specify number of samples (for quick testing)
python run_evaluation.py --max-samples 10

# Run specific tasks only
python run_evaluation.py --tasks noise_robustness negative_rejection

# Use specific models
python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768

Command Line Options

Option	Description
`-d, --data-dir`	Directory containing RGB datasets (default: `data`)
`-o, --output-dir`	Directory to save results (default: `results`)
`-m, --models`	Space-separated list of models to evaluate
`-n, --max-samples`	Maximum samples per task (for testing)
`-t, --tasks`	Specific tasks to run

Download Datasets

# Download all datasets
python download_datasets.py

# Force re-download
python download_datasets.py --force

# Verify existing datasets
python download_datasets.py --verify

Evaluated Models

The project uses Groq's free LLM API with the following models (at least 3 as required):

llama-3.3-70b-versatile - Llama 3.3 70B (best quality)
llama-3.1-8b-instant - Llama 3.1 8B (fastest)
mixtral-8x7b-32768 - Mixtral 8x7B (good balance)

Additional available models:

gemma2-9b-it - Google's Gemma 2 9B

Metrics

Noise Robustness

Accuracy: Percentage of correctly answered questions with noisy documents
Breakdown by noise level (0-4 noise documents)

Negative Rejection

Rejection Rate: Percentage of questions correctly rejected when no answer exists

Information Integration

Accuracy: Percentage of correctly answered questions requiring info from multiple docs

Counterfactual Robustness

Error Detection Rate: Percentage of factual errors detected
Error Correction Rate: Percentage of errors correctly corrected

Output

Results are saved in the results/ directory:

results_YYYYMMDD_HHMMSS.json - Full results in JSON format
summary_YYYYMMDD_HHMMSS.csv - Summary table in CSV format

Example output: ```

RGB RAG EVALUATION RESULTS

--- NOISE ROBUSTNESS --- Model Accuracy Noise Level Breakdown

llama-3.3-70b-versatile 85.50% N0:92.0% | N1:88.0% | N2:84.0%

--- NEGATIVE REJECTION --- Model Rejection Rate Samples

llama-3.3-70b-versatile 72.50% 100

--- INFORMATION INTEGRATION --- Model Accuracy Correct/Total

llama-3.3-70b-versatile 78.00% 78/100

--- COUNTERFACTUAL ROBUSTNESS --- Model Error Det. Error Corr.

llama-3.3-70b-versatile 65.00% 52.00%


## References

- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
- Groq API: [https://console.groq.com/](https://console.groq.com/)

## License

This project is for educational purposes as part of a capstone project.