Spaces:
Running
Running
| title: RAG Benchmark Leaderboard | |
| emoji: π | |
| colorFrom: gray | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.4.0 | |
| app_file: app.py | |
| pinned: false | |
| # RAG Benchmark Leaderboard | |
| An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems. | |
| ## Features | |
| - **Version Comparison**: Compare model performances across different versions of the benchmark dataset | |
| - **Interactive Radar Charts**: Visualize generative and retrieval metrics | |
| - **Customizable Views**: Filter and sort models based on different criteria | |
| - **Easy Submission**: Simple API for submitting your model results | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Running the Leaderboard | |
| ```bash | |
| cd leaderboard | |
| python app.py | |
| ``` | |
| This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860. | |
| ## Submitting Results | |
| To submit your results to the leaderboard, use the provided API: | |
| ```python | |
| from rag_benchmark import RAGBenchmark | |
| # Initialize the benchmark | |
| benchmark = RAGBenchmark(version="2.0") # Use the latest version | |
| # Run evaluation | |
| results = benchmark.evaluate( | |
| model_name="Your Model Name", | |
| embedding_model="your-embedding-model", | |
| retriever_type="dense", # Options: dense, sparse, hybrid | |
| retrieval_config={"top_k": 3} | |
| ) | |
| # Submit results | |
| benchmark.submit_results(results) | |
| ``` | |
| ## Data Format | |
| The results.json file has the following structure: | |
| ```json | |
| { | |
| "items": { | |
| "1.0": { // Dataset version | |
| "model1": { // Submission ID | |
| "model_name": "Model Name", | |
| "timestamp": "2025-03-20T12:00:00", | |
| "config": { | |
| "embedding_model": "embedding-model-name", | |
| "retriever_type": "dense", | |
| "retrieval_config": { | |
| "top_k": 3 | |
| } | |
| }, | |
| "metrics": { | |
| "retrieval": { | |
| "hit_rate": 0.82, | |
| "mrr": 0.65, | |
| "precision": 0.78 | |
| }, | |
| "generation": { | |
| "rouge1": 0.72, | |
| "rouge2": 0.55, | |
| "rougeL": 0.68 | |
| } | |
| } | |
| } | |
| } | |
| }, | |
| "last_version": "2.0", | |
| "n_questions": "1000" | |
| } | |
| ``` | |
| ## License | |
| MIT | |
| # RAG Evaluation Leaderboard | |
| This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics. | |
| ## Metrics Tracked | |
| ### Retrieval Metrics | |
| - Hit Rate: Proportion of relevant documents retrieved | |
| - MRR (Mean Reciprocal Rank): Position of first relevant document | |
| ### Generation Metrics | |
| - ROUGE-1: Unigram overlap | |
| - ROUGE-2: Bigram overlap | |
| - ROUGE-L: Longest common subsequence | |