File size: 2,771 Bytes
aff180f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18206bc
aff180f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---

title: RAG Benchmark Leaderboard
emoji: πŸ“š
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false
---


# RAG Benchmark Leaderboard

An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems.

## Features

- **Version Comparison**: Compare model performances across different versions of the benchmark dataset
- **Interactive Radar Charts**: Visualize generative and retrieval metrics
- **Customizable Views**: Filter and sort models based on different criteria
- **Easy Submission**: Simple API for submitting your model results

## Installation

```bash

pip install -r requirements.txt

```

## Running the Leaderboard

```bash

cd leaderboard

python app.py

```

This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860.

## Submitting Results

To submit your results to the leaderboard, use the provided API:

```python

from rag_benchmark import RAGBenchmark



# Initialize the benchmark

benchmark = RAGBenchmark(version="2.0")  # Use the latest version



# Run evaluation

results = benchmark.evaluate(

    model_name="Your Model Name",

    embedding_model="your-embedding-model",

    retriever_type="dense",  # Options: dense, sparse, hybrid

    retrieval_config={"top_k": 3}

)



# Submit results

benchmark.submit_results(results)

```

## Data Format

The results.json file has the following structure:

```json

{

  "items": {

    "1.0": {  // Dataset version

      "model1": {  // Submission ID

        "model_name": "Model Name",

        "timestamp": "2025-03-20T12:00:00",

        "config": {

          "embedding_model": "embedding-model-name",

          "retriever_type": "dense",

          "retrieval_config": {

            "top_k": 3

          }

        },

        "metrics": {

          "retrieval": {

            "hit_rate": 0.82,

            "mrr": 0.65,

            "precision": 0.78

          },

          "generation": {

            "rouge1": 0.72,

            "rouge2": 0.55,

            "rougeL": 0.68

          }

        }

      }

    }

  },

  "last_version": "2.0",

  "n_questions": "1000"

}

```

## License

MIT

# RAG Evaluation Leaderboard

    This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics.


    ## Metrics Tracked


    ### Retrieval Metrics

    - Hit Rate: Proportion of relevant documents retrieved

    - MRR (Mean Reciprocal Rank): Position of first relevant document


    ### Generation Metrics

    - ROUGE-1: Unigram overlap

    - ROUGE-2: Bigram overlap

    - ROUGE-L: Longest common subsequence