YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Portable 10-Benchmark Eval Bundle
This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.
Included
agent_eval_api/with the top-level 10-benchmark runners and bundled code forAPTBench,VLMEvalKit,thinking-in-space, BFCL, and local task configsAgentBench/code/config needed forDBBenchlm-evaluation-harness/forARC,RULER,HH-RLHF, andAdvBenchenv.example.shwith cluster-specific path placeholders
Not Included
- model checkpoints or HF snapshots
hf_cache/,LMUData/, downloaded benchmark data- local result directories such as
manual_runs/,runs/,automation*/,score/, or BFCLresult/ AgentBench/data/and other benchmark payload data
Directory Layout
portable_10bench_eval_bundle_20260330_0140/
README.md
env.example.sh
agent_eval_api/
AgentBench/
lm-evaluation-harness/
The patched scripts in this bundle use relative paths where possible.
Environment Setup
You still need working Python environments on the target cluster. A practical split is:
vllmenv: model servingvsibenchenv:VSI-Benchand currentVLMEvalKitwrappersBFCLenv: BFCL- base Python env: text-benchmark wrappers and orchestration
A simple starting point is:
cd portable_10bench_eval_bundle_20260330_0140
source env.example.sh
Then edit env.example.sh to match the new cluster.
Data / Cache Prep
This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:
- Model weights or HF snapshots
- For local models, download them to a local path and pass that path as
--tokenizer/--modelinputs. - For the parallel round launcher, set:
SNAPSHOT_R5=/path/to/round5_snapshotSNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
- Hugging Face cache for multimodal benchmarks
MMBench,VideoMME, andVSI-Benchexpect a shared cache root.- Create a cache directory such as
/path/to/hf_cache, then export:
export FIXED_HF_CACHE=/path/to/hf_cache
Typical subpaths used by the runners are:
$FIXED_HF_CACHE$FIXED_HF_CACHE/hub$FIXED_HF_CACHE/datasets$FIXED_HF_CACHE/LMUDataforMMBench
- Docker for DBBench
DBBenchusesdocker composeviaAgentBench/extra/docker-compose.yml.- Make sure Docker and Compose are available on the cluster node.
Core Entry Points
1. All 10 benchmarks against one API
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_all_10bench_db_api.sh \
--api-base http://127.0.0.1:8100/v1 \
--model your_served_model_name \
--tokenizer /path/to/local/model_or_tokenizer \
--full \
--canonical-hf-home "$FIXED_HF_CACHE" \
--canonical-vlmeval-cache "$FIXED_HF_CACHE" \
--vsibench-python "$VSIBENCH_PYTHON" \
--vlmeval-python "$VLMEVAL_PYTHON" \
--bfcl-python "$BFCL_PYTHON" \
--base-python "$BASE_PYTHON" \
--aptbench-python "$APTBENCH_PYTHON" \
--dbbench-python "$DBBENCH_PYTHON" \
--output-root ./manual_runs/your_model_run/benchmarks \
--tag your_model_run
2. Single text benchmark
For arc, ruler, hh_rlhf, or advbench:
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_eval_task_api.sh arc \
--api-base http://127.0.0.1:8100/v1 \
--model your_served_model_name \
--tokenizer /path/to/local/model_or_tokenizer \
--lm-eval-dir ../lm-evaluation-harness \
--include-path ./tasks \
--full
3. Individual multimodal benchmarks
MMBench:agent_eval_api/run_mmbench_api.shVideoMME:agent_eval_api/run_videomme_api.shVSI-Bench:agent_eval_api/run_vsibench_api.shAPTBench:agent_eval_api/run_aptbench_api.sh
Example for VideoMME infer mode:
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_videomme_api.sh \
--api-base http://127.0.0.1:8100/v1 \
--model your_served_model_name \
--model-alias your_served_model_name \
--run-mode infer \
--api-nproc 4 \
--hf-home "$FIXED_HF_CACHE" \
--hf-hub-cache "$FIXED_HF_CACHE/hub" \
--hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
--output-root ./manual_runs/videomme_your_model
4. Parallel round launcher
Before running this helper, you must set:
export SNAPSHOT_R5=/path/to/round5_snapshot
export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
export FIXED_HF_CACHE=/path/to/hf_cache
export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
export VLLM_PYTHON=/path/to/envs/vllm/bin/python
Then run:
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_rounds_from_bfcl_parallel.sh
Benchmark Notes
ARC,RULER,HH-RLHF,AdvBench- use
lm-evaluation-harnessplus local task configs underagent_eval_api/tasks/
- use
BFCL- uses the bundled BFCL code under
agent_eval_api/gorilla/berkeley-function-call-leaderboard/
- uses the bundled BFCL code under
MMBench,VideoMME- use
VLMEvalKit; these wrappers are often used ininfermode for leaderboard submission workflows
- use
VSI-Bench- uses the bundled
thinking-in-spaceintegration andlmms_evalin the chosen Python env
- uses the bundled
APTBench- uses
agent_eval_api/APTBench/code/
- uses
DBBench- uses the bundled
AgentBench/code and Docker services
- uses the bundled
Sanity Checks on a New Cluster
Before running a full evaluation, check:
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash -n run_all_10bench_db_api.sh
bash -n run_eval_task_api.sh
bash -n run_mmbench_api.sh
bash -n run_videomme_api.sh
bash -n run_vsibench_api.sh
bash -n run_aptbench_api.sh
curl -fsS http://127.0.0.1:8100/v1/models
If the API responds and these scripts parse, the bundle layout is consistent.
Scope
This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:
- install the Python environments on the target cluster
- download the models and benchmark data/cache there
- point the scripts at the new local paths
Portability Notes
HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.
APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.