| # Portable 10-Benchmark Eval Bundle |
|
|
| This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs. |
|
|
| ## Included |
|
|
| - `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs |
| - `AgentBench/` code/config needed for `DBBench` |
| - `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench` |
| - `env.example.sh` with cluster-specific path placeholders |
|
|
| ## Not Included |
|
|
| - model checkpoints or HF snapshots |
| - `hf_cache/`, `LMUData/`, downloaded benchmark data |
| - local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/` |
| - `AgentBench/data/` and other benchmark payload data |
|
|
| ## Directory Layout |
|
|
| ``` |
| portable_10bench_eval_bundle_20260330_0140/ |
| README.md |
| env.example.sh |
| agent_eval_api/ |
| AgentBench/ |
| lm-evaluation-harness/ |
| ``` |
|
|
| The patched scripts in this bundle use relative paths where possible. |
|
|
| ## Environment Setup |
|
|
| You still need working Python environments on the target cluster. A practical split is: |
|
|
| - `vllm` env: model serving |
| - `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers |
| - `BFCL` env: BFCL |
| - base Python env: text-benchmark wrappers and orchestration |
|
|
| A simple starting point is: |
|
|
| ```bash |
| cd portable_10bench_eval_bundle_20260330_0140 |
| source env.example.sh |
| ``` |
|
|
| Then edit `env.example.sh` to match the new cluster. |
|
|
| ## Data / Cache Prep |
|
|
| This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster: |
|
|
| 1. Model weights or HF snapshots |
| - For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs. |
| - For the parallel round launcher, set: |
| - `SNAPSHOT_R5=/path/to/round5_snapshot` |
| - `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot` |
|
|
| 2. Hugging Face cache for multimodal benchmarks |
| - `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root. |
| - Create a cache directory such as `/path/to/hf_cache`, then export: |
|
|
| ```bash |
| export FIXED_HF_CACHE=/path/to/hf_cache |
| ``` |
|
|
| Typical subpaths used by the runners are: |
| - `$FIXED_HF_CACHE` |
| - `$FIXED_HF_CACHE/hub` |
| - `$FIXED_HF_CACHE/datasets` |
| - `$FIXED_HF_CACHE/LMUData` for `MMBench` |
|
|
| 3. Docker for DBBench |
| - `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`. |
| - Make sure Docker and Compose are available on the cluster node. |
|
|
| ## Core Entry Points |
|
|
| ### 1. All 10 benchmarks against one API |
|
|
| ```bash |
| cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api |
| |
| bash ./run_all_10bench_db_api.sh \ |
| --api-base http://127.0.0.1:8100/v1 \ |
| --model your_served_model_name \ |
| --tokenizer /path/to/local/model_or_tokenizer \ |
| --full \ |
| --canonical-hf-home "$FIXED_HF_CACHE" \ |
| --canonical-vlmeval-cache "$FIXED_HF_CACHE" \ |
| --vsibench-python "$VSIBENCH_PYTHON" \ |
| --vlmeval-python "$VLMEVAL_PYTHON" \ |
| --bfcl-python "$BFCL_PYTHON" \ |
| --base-python "$BASE_PYTHON" \ |
| --aptbench-python "$APTBENCH_PYTHON" \ |
| --dbbench-python "$DBBENCH_PYTHON" \ |
| --output-root ./manual_runs/your_model_run/benchmarks \ |
| --tag your_model_run |
| ``` |
|
|
| ### 2. Single text benchmark |
|
|
| For `arc`, `ruler`, `hh_rlhf`, or `advbench`: |
|
|
| ```bash |
| cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api |
| |
| bash ./run_eval_task_api.sh arc \ |
| --api-base http://127.0.0.1:8100/v1 \ |
| --model your_served_model_name \ |
| --tokenizer /path/to/local/model_or_tokenizer \ |
| --lm-eval-dir ../lm-evaluation-harness \ |
| --include-path ./tasks \ |
| --full |
| ``` |
|
|
| ### 3. Individual multimodal benchmarks |
|
|
| - `MMBench`: `agent_eval_api/run_mmbench_api.sh` |
| - `VideoMME`: `agent_eval_api/run_videomme_api.sh` |
| - `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh` |
| - `APTBench`: `agent_eval_api/run_aptbench_api.sh` |
|
|
| Example for `VideoMME` infer mode: |
|
|
| ```bash |
| cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api |
| |
| bash ./run_videomme_api.sh \ |
| --api-base http://127.0.0.1:8100/v1 \ |
| --model your_served_model_name \ |
| --model-alias your_served_model_name \ |
| --run-mode infer \ |
| --api-nproc 4 \ |
| --hf-home "$FIXED_HF_CACHE" \ |
| --hf-hub-cache "$FIXED_HF_CACHE/hub" \ |
| --hf-datasets-cache "$FIXED_HF_CACHE/datasets" \ |
| --output-root ./manual_runs/videomme_your_model |
| ``` |
|
|
| ### 4. Parallel round launcher |
|
|
| Before running this helper, you must set: |
|
|
| ```bash |
| export SNAPSHOT_R5=/path/to/round5_snapshot |
| export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot |
| export FIXED_HF_CACHE=/path/to/hf_cache |
| export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python |
| export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python |
| export BFCL_PYTHON=/path/to/envs/BFCL/bin/python |
| export VLLM_PYTHON=/path/to/envs/vllm/bin/python |
| ``` |
|
|
| Then run: |
|
|
| ```bash |
| cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api |
| bash ./run_rounds_from_bfcl_parallel.sh |
| ``` |
|
|
| ## Benchmark Notes |
|
|
| - `ARC`, `RULER`, `HH-RLHF`, `AdvBench` |
| - use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/` |
| - `BFCL` |
| - uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/` |
| - `MMBench`, `VideoMME` |
| - use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows |
| - `VSI-Bench` |
| - uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env |
| - `APTBench` |
| - uses `agent_eval_api/APTBench/code/` |
| - `DBBench` |
| - uses the bundled `AgentBench/` code and Docker services |
|
|
| ## Sanity Checks on a New Cluster |
|
|
| Before running a full evaluation, check: |
|
|
| ```bash |
| cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api |
| |
| bash -n run_all_10bench_db_api.sh |
| bash -n run_eval_task_api.sh |
| bash -n run_mmbench_api.sh |
| bash -n run_videomme_api.sh |
| bash -n run_vsibench_api.sh |
| bash -n run_aptbench_api.sh |
| |
| curl -fsS http://127.0.0.1:8100/v1/models |
| ``` |
|
|
| If the API responds and these scripts parse, the bundle layout is consistent. |
|
|
| ## Scope |
|
|
| This package is a reusable evaluation code bundle, not a frozen environment export. You still need to: |
|
|
| - install the Python environments on the target cluster |
| - download the models and benchmark data/cache there |
| - point the scripts at the new local paths |
|
|
| ## Portability Notes |
|
|
| - HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root. |
| - The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle. |
| - The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE. |
| - VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard. |
|
|
| - APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster. |
|
|