Upload bundle README

424158e verified 12 days ago

6.91 kB

	# Portable 10-Benchmark Eval Bundle

	This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.

	## Included

	- `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs
	- `AgentBench/` code/config needed for `DBBench`
	- `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench`
	- `env.example.sh` with cluster-specific path placeholders

	## Not Included

	- model checkpoints or HF snapshots
	- `hf_cache/`, `LMUData/`, downloaded benchmark data
	- local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/`
	- `AgentBench/data/` and other benchmark payload data

	## Directory Layout

	```
	portable_10bench_eval_bundle_20260330_0140/
	README.md
	env.example.sh
	agent_eval_api/
	AgentBench/
	lm-evaluation-harness/
	```

	The patched scripts in this bundle use relative paths where possible.

	## Environment Setup

	You still need working Python environments on the target cluster. A practical split is:

	- `vllm` env: model serving
	- `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers
	- `BFCL` env: BFCL
	- base Python env: text-benchmark wrappers and orchestration

	A simple starting point is:

	```bash
	cd portable_10bench_eval_bundle_20260330_0140
	source env.example.sh
	```

	Then edit `env.example.sh` to match the new cluster.

	## Data / Cache Prep

	This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:

	1. Model weights or HF snapshots
	- For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs.
	- For the parallel round launcher, set:
	- `SNAPSHOT_R5=/path/to/round5_snapshot`
	- `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot`

	2. Hugging Face cache for multimodal benchmarks
	- `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root.
	- Create a cache directory such as `/path/to/hf_cache`, then export:

	```bash
	export FIXED_HF_CACHE=/path/to/hf_cache
	```

	Typical subpaths used by the runners are:
	- `$FIXED_HF_CACHE`
	- `$FIXED_HF_CACHE/hub`
	- `$FIXED_HF_CACHE/datasets`
	- `$FIXED_HF_CACHE/LMUData` for `MMBench`

	3. Docker for DBBench
	- `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`.
	- Make sure Docker and Compose are available on the cluster node.

	## Core Entry Points

	### 1. All 10 benchmarks against one API

	```bash
	cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

	bash ./run_all_10bench_db_api.sh \
	--api-base http://127.0.0.1:8100/v1 \
	--model your_served_model_name \
	--tokenizer /path/to/local/model_or_tokenizer \
	--full \
	--canonical-hf-home "$FIXED_HF_CACHE" \
	--canonical-vlmeval-cache "$FIXED_HF_CACHE" \
	--vsibench-python "$VSIBENCH_PYTHON" \
	--vlmeval-python "$VLMEVAL_PYTHON" \
	--bfcl-python "$BFCL_PYTHON" \
	--base-python "$BASE_PYTHON" \
	--aptbench-python "$APTBENCH_PYTHON" \
	--dbbench-python "$DBBENCH_PYTHON" \
	--output-root ./manual_runs/your_model_run/benchmarks \
	--tag your_model_run
	```

	### 2. Single text benchmark

	For `arc`, `ruler`, `hh_rlhf`, or `advbench`:

	```bash
	cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

	bash ./run_eval_task_api.sh arc \
	--api-base http://127.0.0.1:8100/v1 \
	--model your_served_model_name \
	--tokenizer /path/to/local/model_or_tokenizer \
	--lm-eval-dir ../lm-evaluation-harness \
	--include-path ./tasks \
	--full
	```

	### 3. Individual multimodal benchmarks

	- `MMBench`: `agent_eval_api/run_mmbench_api.sh`
	- `VideoMME`: `agent_eval_api/run_videomme_api.sh`
	- `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh`
	- `APTBench`: `agent_eval_api/run_aptbench_api.sh`

	Example for `VideoMME` infer mode:

	```bash
	cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

	bash ./run_videomme_api.sh \
	--api-base http://127.0.0.1:8100/v1 \
	--model your_served_model_name \
	--model-alias your_served_model_name \
	--run-mode infer \
	--api-nproc 4 \
	--hf-home "$FIXED_HF_CACHE" \
	--hf-hub-cache "$FIXED_HF_CACHE/hub" \
	--hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
	--output-root ./manual_runs/videomme_your_model
	```

	### 4. Parallel round launcher

	Before running this helper, you must set:

	```bash
	export SNAPSHOT_R5=/path/to/round5_snapshot
	export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
	export FIXED_HF_CACHE=/path/to/hf_cache
	export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
	export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
	export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
	export VLLM_PYTHON=/path/to/envs/vllm/bin/python
	```

	Then run:

	```bash
	cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
	bash ./run_rounds_from_bfcl_parallel.sh
	```

	## Benchmark Notes

	- `ARC`, `RULER`, `HH-RLHF`, `AdvBench`
	- use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/`
	- `BFCL`
	- uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/`
	- `MMBench`, `VideoMME`
	- use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows
	- `VSI-Bench`
	- uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env
	- `APTBench`
	- uses `agent_eval_api/APTBench/code/`
	- `DBBench`
	- uses the bundled `AgentBench/` code and Docker services

	## Sanity Checks on a New Cluster

	Before running a full evaluation, check:

	```bash
	cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

	bash -n run_all_10bench_db_api.sh
	bash -n run_eval_task_api.sh
	bash -n run_mmbench_api.sh
	bash -n run_videomme_api.sh
	bash -n run_vsibench_api.sh
	bash -n run_aptbench_api.sh

	curl -fsS http://127.0.0.1:8100/v1/models
	```

	If the API responds and these scripts parse, the bundle layout is consistent.

	## Scope

	This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:

	- install the Python environments on the target cluster
	- download the models and benchmark data/cache there
	- point the scripts at the new local paths

	## Portability Notes

	- HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
	- The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
	- The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
	- VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.

	- APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.