Spaces:

gopikrishnait
/

RGBMetrics

Sleeping

App Files Files Community

RGBMetrics / README.md

RGB Evaluation

fix: Add HF Spaces YAML configuration to README.md

ca5ddcb about 2 months ago

preview code

raw

history blame contribute delete

6.14 kB

	---
	title: RGB RAG Evaluation Dashboard
	emoji: 📊
	colorFrom: blue
	colorTo: green
	sdk: streamlit
	sdk_version: "1.28.0"
	app_file: app.py
	pinned: false
	---

	# RGB RAG Evaluation Project

	A Python project for evaluating LLM abilities for Retrieval-Augmented Generation (RAG) using the RGB benchmark dataset with Groq's free LLM API.

	## Project Overview

	This project evaluates four key RAG abilities as defined in the research paper [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431):

	1. Noise Robustness: Ability to handle noisy/irrelevant documents
	2. Negative Rejection: Ability to reject answering when documents don't contain the answer
	3. Information Integration: Ability to combine information from multiple documents
	4. Counterfactual Robustness: Ability to detect and correct factual errors in documents

	## Requirements

	- Python 3.9+
	- Groq API Key (free at https://console.groq.com/)

	## Installation

	1. Clone and setup environment:
	```bash
	cd d:\CapStoneProject\RGB
	python -m venv .venv
	.venv\Scripts\activate # Windows
	# source .venv/bin/activate # Linux/Mac
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Set up Groq API Key:
	```bash
	# Copy example env file
	copy .env.example .env

	# Edit .env and add your Groq API key
	# Get free key at: https://console.groq.com/
	```

	4. Download datasets:
	```bash
	python download_datasets.py
	```

	## Project Structure

	```
	RGB/
	├── data/ # Dataset files (downloaded)
	│ ├── en_refine.json # Noise robustness & negative rejection
	│ ├── en_int.json # Information integration
	│ └── en_fact.json # Counterfactual robustness
	├── results/ # Evaluation results
	├── src/
	│ ├── __init__.py
	│ ├── config.py # Configuration settings
	│ ├── llm_client.py # Groq LLM client
	│ ├── data_loader.py # RGB dataset loader
	│ ├── evaluator.py # Evaluation metrics
	│ ├── prompts.py # Prompt templates
	│ └── pipeline.py # Main evaluation pipeline
	├── download_datasets.py # Dataset downloader
	├── run_evaluation.py # Main entry point
	├── requirements.txt # Python dependencies
	├── .env.example # Environment variables template
	└── README.md # This file
	```

	## Usage

	### Run Full Evaluation

	```bash
	# Run with default settings (3 models, all tasks)
	python run_evaluation.py

	# Specify number of samples (for quick testing)
	python run_evaluation.py --max-samples 10

	# Run specific tasks only
	python run_evaluation.py --tasks noise_robustness negative_rejection

	# Use specific models
	python run_evaluation.py --models llama-3.3-70b-versatile mixtral-8x7b-32768
	```

	### Command Line Options

	\| Option \| Description \|
	\|--------\|-------------\|
	\| `-d, --data-dir` \| Directory containing RGB datasets (default: `data`) \|
	\| `-o, --output-dir` \| Directory to save results (default: `results`) \|
	\| `-m, --models` \| Space-separated list of models to evaluate \|
	\| `-n, --max-samples` \| Maximum samples per task (for testing) \|
	\| `-t, --tasks` \| Specific tasks to run \|

	### Download Datasets

	```bash
	# Download all datasets
	python download_datasets.py

	# Force re-download
	python download_datasets.py --force

	# Verify existing datasets
	python download_datasets.py --verify
	```

	## Evaluated Models

	The project uses Groq's free LLM API with the following models (at least 3 as required):

	1. llama-3.3-70b-versatile - Llama 3.3 70B (best quality)
	2. llama-3.1-8b-instant - Llama 3.1 8B (fastest)
	3. mixtral-8x7b-32768 - Mixtral 8x7B (good balance)

	Additional available models:
	- `gemma2-9b-it` - Google's Gemma 2 9B

	## Metrics

	### Noise Robustness
	- Accuracy: Percentage of correctly answered questions with noisy documents
	- Breakdown by noise level (0-4 noise documents)

	### Negative Rejection
	- Rejection Rate: Percentage of questions correctly rejected when no answer exists

	### Information Integration
	- Accuracy: Percentage of correctly answered questions requiring info from multiple docs

	### Counterfactual Robustness
	- Error Detection Rate: Percentage of factual errors detected
	- Error Correction Rate: Percentage of errors correctly corrected

	## Output

	Results are saved in the `results/` directory:
	- `results_YYYYMMDD_HHMMSS.json` - Full results in JSON format
	- `summary_YYYYMMDD_HHMMSS.csv` - Summary table in CSV format

	Example output:
	```
	================================================================================
	RGB RAG EVALUATION RESULTS
	================================================================================

	--- NOISE ROBUSTNESS ---
	Model Accuracy Noise Level Breakdown
	----------------------------------------------------------------------
	llama-3.3-70b-versatile 85.50% N0:92.0% \| N1:88.0% \| N2:84.0%

	--- NEGATIVE REJECTION ---
	Model Rejection Rate Samples
	------------------------------------------------------------
	llama-3.3-70b-versatile 72.50% 100

	--- INFORMATION INTEGRATION ---
	Model Accuracy Correct/Total
	------------------------------------------------------------
	llama-3.3-70b-versatile 78.00% 78/100

	--- COUNTERFACTUAL ROBUSTNESS ---
	Model Error Det. Error Corr.
	------------------------------------------------------------
	llama-3.3-70b-versatile 65.00% 52.00%
	```

	## References

	- Research Paper: [Benchmarking Large Language Models in Retrieval-Augmented Generation](https://arxiv.org/pdf/2309.01431)
	- RGB Dataset: [https://github.com/chen700564/RGB](https://github.com/chen700564/RGB)
	- Groq API: [https://console.groq.com/](https://console.groq.com/)

	## License

	This project is for educational purposes as part of a capstone project.