Spaces:
Sleeping
Sleeping
| title: SearchAgent_Leaderboard | |
| emoji: 🥇 | |
| colorFrom: green | |
| colorTo: indigo | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: true | |
| license: apache-2.0 | |
| short_description: A standardized leaderboard for search agents | |
| sdk_version: 4.23.0 | |
| tags: | |
| - leaderboard | |
| # Overview | |
| SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across: | |
| - General QA: NQ, TriviaQA, PopQA | |
| - Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle | |
| - Novel closed-world: FictionalHot | |
| We display a minimal set of columns for clarity: | |
| - Rank, Model, Average, per-dataset scores, Model Size (3B/7B) | |
| # Data format (results) | |
| Place model result files in `eval-results/` as JSON. Scores are decimals in [0,1] (the UI multiplies by 100). | |
| ```json | |
| { | |
| "config": { | |
| "model_dtype": "torch.float16", | |
| "model_name": "YourMethod-Qwen2.5-7b-Instruct", | |
| "model_sha": "main" | |
| }, | |
| "results": { | |
| "nq": { "exact_match": 0.469 }, | |
| "triviaqa": { "exact_match": 0.640 }, | |
| "popqa": { "exact_match": 0.501 }, | |
| "hotpotqa": { "exact_match": 0.389 }, | |
| "2wiki": { "exact_match": 0.382 }, | |
| "musique": { "exact_match": 0.185 }, | |
| "bamboogle": { "exact_match": 0.392 }, | |
| "fictionalhot": { "exact_match": 0.061 } | |
| } | |
| } | |
| ``` | |
| Notes: | |
| - `model_name` uses the format `Method-Qwen2.5-{3b|7b}-Instruct` (no org prefix required) | |
| - Tasks: `nq`, `triviaqa`, `popqa`, `hotpotqa`, `2wiki`, `musique`, `bamboogle`, `fictionalhot` | |
| - Metric key: `exact_match` | |
| # Submission (via Community) | |
| We accept submissions via the Space Community (Discussions): | |
| 1) Open the Space page and go to Community: `https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard` | |
| 2) Create a discussion with title `Submission: <YourMethod>-<model_name>-<model_size>` | |
| 3) Include: | |
| - Model weights link (HF or GitHub) | |
| - Short method description | |
| - Evaluation JSON (inline or attached) | |
| # Local development | |
| Run locally (example): | |
| ```bash | |
| python app.py | |
| ``` | |
| The app reads local data only (no remote download) from: | |
| - Results: `./eval-results` | |
| - (Optional) Requests: `./eval-queue` (not required for the simplified table) | |
| If you see missing dependencies, install minimally: | |
| ```bash | |
| pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler | |
| ``` | |
| # Customize | |
| - Tasks and page texts: `src/about.py` | |
| - Displayed columns: `src/display/utils.py` (we keep Rank, Model, Average, per-dataset, Model Size) | |
| - Custom model links (name→URL mapping): `src/display/formatting.py` (`custom_links` dict) | |
| - Data loading and ranking: `src/leaderboard/read_evals.py`, `src/populate.py` | |
| Restart the app after changes. |