File size: 6,907 Bytes
424158e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | # Portable 10-Benchmark Eval Bundle
This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.
## Included
- `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs
- `AgentBench/` code/config needed for `DBBench`
- `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench`
- `env.example.sh` with cluster-specific path placeholders
## Not Included
- model checkpoints or HF snapshots
- `hf_cache/`, `LMUData/`, downloaded benchmark data
- local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/`
- `AgentBench/data/` and other benchmark payload data
## Directory Layout
```
portable_10bench_eval_bundle_20260330_0140/
README.md
env.example.sh
agent_eval_api/
AgentBench/
lm-evaluation-harness/
```
The patched scripts in this bundle use relative paths where possible.
## Environment Setup
You still need working Python environments on the target cluster. A practical split is:
- `vllm` env: model serving
- `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers
- `BFCL` env: BFCL
- base Python env: text-benchmark wrappers and orchestration
A simple starting point is:
```bash
cd portable_10bench_eval_bundle_20260330_0140
source env.example.sh
```
Then edit `env.example.sh` to match the new cluster.
## Data / Cache Prep
This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:
1. Model weights or HF snapshots
- For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs.
- For the parallel round launcher, set:
- `SNAPSHOT_R5=/path/to/round5_snapshot`
- `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot`
2. Hugging Face cache for multimodal benchmarks
- `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root.
- Create a cache directory such as `/path/to/hf_cache`, then export:
```bash
export FIXED_HF_CACHE=/path/to/hf_cache
```
Typical subpaths used by the runners are:
- `$FIXED_HF_CACHE`
- `$FIXED_HF_CACHE/hub`
- `$FIXED_HF_CACHE/datasets`
- `$FIXED_HF_CACHE/LMUData` for `MMBench`
3. Docker for DBBench
- `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`.
- Make sure Docker and Compose are available on the cluster node.
## Core Entry Points
### 1. All 10 benchmarks against one API
```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_all_10bench_db_api.sh \
--api-base http://127.0.0.1:8100/v1 \
--model your_served_model_name \
--tokenizer /path/to/local/model_or_tokenizer \
--full \
--canonical-hf-home "$FIXED_HF_CACHE" \
--canonical-vlmeval-cache "$FIXED_HF_CACHE" \
--vsibench-python "$VSIBENCH_PYTHON" \
--vlmeval-python "$VLMEVAL_PYTHON" \
--bfcl-python "$BFCL_PYTHON" \
--base-python "$BASE_PYTHON" \
--aptbench-python "$APTBENCH_PYTHON" \
--dbbench-python "$DBBENCH_PYTHON" \
--output-root ./manual_runs/your_model_run/benchmarks \
--tag your_model_run
```
### 2. Single text benchmark
For `arc`, `ruler`, `hh_rlhf`, or `advbench`:
```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_eval_task_api.sh arc \
--api-base http://127.0.0.1:8100/v1 \
--model your_served_model_name \
--tokenizer /path/to/local/model_or_tokenizer \
--lm-eval-dir ../lm-evaluation-harness \
--include-path ./tasks \
--full
```
### 3. Individual multimodal benchmarks
- `MMBench`: `agent_eval_api/run_mmbench_api.sh`
- `VideoMME`: `agent_eval_api/run_videomme_api.sh`
- `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh`
- `APTBench`: `agent_eval_api/run_aptbench_api.sh`
Example for `VideoMME` infer mode:
```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_videomme_api.sh \
--api-base http://127.0.0.1:8100/v1 \
--model your_served_model_name \
--model-alias your_served_model_name \
--run-mode infer \
--api-nproc 4 \
--hf-home "$FIXED_HF_CACHE" \
--hf-hub-cache "$FIXED_HF_CACHE/hub" \
--hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
--output-root ./manual_runs/videomme_your_model
```
### 4. Parallel round launcher
Before running this helper, you must set:
```bash
export SNAPSHOT_R5=/path/to/round5_snapshot
export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
export FIXED_HF_CACHE=/path/to/hf_cache
export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
export VLLM_PYTHON=/path/to/envs/vllm/bin/python
```
Then run:
```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_rounds_from_bfcl_parallel.sh
```
## Benchmark Notes
- `ARC`, `RULER`, `HH-RLHF`, `AdvBench`
- use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/`
- `BFCL`
- uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/`
- `MMBench`, `VideoMME`
- use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows
- `VSI-Bench`
- uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env
- `APTBench`
- uses `agent_eval_api/APTBench/code/`
- `DBBench`
- uses the bundled `AgentBench/` code and Docker services
## Sanity Checks on a New Cluster
Before running a full evaluation, check:
```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash -n run_all_10bench_db_api.sh
bash -n run_eval_task_api.sh
bash -n run_mmbench_api.sh
bash -n run_videomme_api.sh
bash -n run_vsibench_api.sh
bash -n run_aptbench_api.sh
curl -fsS http://127.0.0.1:8100/v1/models
```
If the API responds and these scripts parse, the bundle layout is consistent.
## Scope
This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:
- install the Python environments on the target cluster
- download the models and benchmark data/cache there
- point the scripts at the new local paths
## Portability Notes
- HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
- The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
- The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
- VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.
- APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.
|