SearchAgent_Leaderboard

Sleeping

App Files Files Community

SearchAgent_Leaderboard / README.md

shyuli

version v0.1

eae2329 4 months ago

preview code

raw

history blame contribute delete

2.6 kB

	---
	title: SearchAgent_Leaderboard
	emoji: 🥇
	colorFrom: green
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: true
	license: apache-2.0
	short_description: A standardized leaderboard for search agents
	sdk_version: 4.23.0
	tags:
	- leaderboard
	---

	# Overview

	SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across:
	- General QA: NQ, TriviaQA, PopQA
	- Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle
	- Novel closed-world: FictionalHot

	We display a minimal set of columns for clarity:
	- Rank, Model, Average, per-dataset scores, Model Size (3B/7B)


	# Data format (results)

	Place model result files in `eval-results/` as JSON. Scores are decimals in [0,1] (the UI multiplies by 100).

	```json
	{
	"config": {
	"model_dtype": "torch.float16",
	"model_name": "YourMethod-Qwen2.5-7b-Instruct",
	"model_sha": "main"
	},
	"results": {
	"nq": { "exact_match": 0.469 },
	"triviaqa": { "exact_match": 0.640 },
	"popqa": { "exact_match": 0.501 },
	"hotpotqa": { "exact_match": 0.389 },
	"2wiki": { "exact_match": 0.382 },
	"musique": { "exact_match": 0.185 },
	"bamboogle": { "exact_match": 0.392 },
	"fictionalhot": { "exact_match": 0.061 }
	}
	}
	```

	Notes:
	- `model_name` uses the format `Method-Qwen2.5-{3b\|7b}-Instruct` (no org prefix required)
	- Tasks: `nq`, `triviaqa`, `popqa`, `hotpotqa`, `2wiki`, `musique`, `bamboogle`, `fictionalhot`
	- Metric key: `exact_match`


	# Submission (via Community)

	We accept submissions via the Space Community (Discussions):
	1) Open the Space page and go to Community: `https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard`
	2) Create a discussion with title `Submission: <YourMethod>-<model_name>-<model_size>`
	3) Include:
	- Model weights link (HF or GitHub)
	- Short method description
	- Evaluation JSON (inline or attached)


	# Local development

	Run locally (example):
	```bash
	python app.py
	```

	The app reads local data only (no remote download) from:
	- Results: `./eval-results`
	- (Optional) Requests: `./eval-queue` (not required for the simplified table)

	If you see missing dependencies, install minimally:
	```bash
	pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler
	```


	# Customize

	- Tasks and page texts: `src/about.py`
	- Displayed columns: `src/display/utils.py` (we keep Rank, Model, Average, per-dataset, Model Size)
	- Custom model links (name→URL mapping): `src/display/formatting.py` (`custom_links` dict)
	- Data loading and ranking: `src/leaderboard/read_evals.py`, `src/populate.py`

	Restart the app after changes.