File size: 6,907 Bytes
424158e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
# Portable 10-Benchmark Eval Bundle

This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.

## Included

- `agent_eval_api/` with the top-level 10-benchmark runners and bundled code for `APTBench`, `VLMEvalKit`, `thinking-in-space`, BFCL, and local task configs
- `AgentBench/` code/config needed for `DBBench`
- `lm-evaluation-harness/` for `ARC`, `RULER`, `HH-RLHF`, and `AdvBench`
- `env.example.sh` with cluster-specific path placeholders

## Not Included

- model checkpoints or HF snapshots
- `hf_cache/`, `LMUData/`, downloaded benchmark data
- local result directories such as `manual_runs/`, `runs/`, `automation*/`, `score/`, or BFCL `result/`
- `AgentBench/data/` and other benchmark payload data

## Directory Layout

```
portable_10bench_eval_bundle_20260330_0140/
  README.md
  env.example.sh
  agent_eval_api/
  AgentBench/
  lm-evaluation-harness/
```

The patched scripts in this bundle use relative paths where possible.

## Environment Setup

You still need working Python environments on the target cluster. A practical split is:

- `vllm` env: model serving
- `vsibench` env: `VSI-Bench` and current `VLMEvalKit` wrappers
- `BFCL` env: BFCL
- base Python env: text-benchmark wrappers and orchestration

A simple starting point is:

```bash
cd portable_10bench_eval_bundle_20260330_0140
source env.example.sh
```

Then edit `env.example.sh` to match the new cluster.

## Data / Cache Prep

This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:

1. Model weights or HF snapshots
- For local models, download them to a local path and pass that path as `--tokenizer` / `--model` inputs.
- For the parallel round launcher, set:
  - `SNAPSHOT_R5=/path/to/round5_snapshot`
  - `SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot`

2. Hugging Face cache for multimodal benchmarks
- `MMBench`, `VideoMME`, and `VSI-Bench` expect a shared cache root.
- Create a cache directory such as `/path/to/hf_cache`, then export:

```bash
export FIXED_HF_CACHE=/path/to/hf_cache
```

Typical subpaths used by the runners are:
- `$FIXED_HF_CACHE`
- `$FIXED_HF_CACHE/hub`
- `$FIXED_HF_CACHE/datasets`
- `$FIXED_HF_CACHE/LMUData` for `MMBench`

3. Docker for DBBench
- `DBBench` uses `docker compose` via `AgentBench/extra/docker-compose.yml`.
- Make sure Docker and Compose are available on the cluster node.

## Core Entry Points

### 1. All 10 benchmarks against one API

```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash ./run_all_10bench_db_api.sh \
  --api-base http://127.0.0.1:8100/v1 \
  --model your_served_model_name \
  --tokenizer /path/to/local/model_or_tokenizer \
  --full \
  --canonical-hf-home "$FIXED_HF_CACHE" \
  --canonical-vlmeval-cache "$FIXED_HF_CACHE" \
  --vsibench-python "$VSIBENCH_PYTHON" \
  --vlmeval-python "$VLMEVAL_PYTHON" \
  --bfcl-python "$BFCL_PYTHON" \
  --base-python "$BASE_PYTHON" \
  --aptbench-python "$APTBENCH_PYTHON" \
  --dbbench-python "$DBBENCH_PYTHON" \
  --output-root ./manual_runs/your_model_run/benchmarks \
  --tag your_model_run
```

### 2. Single text benchmark

For `arc`, `ruler`, `hh_rlhf`, or `advbench`:

```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash ./run_eval_task_api.sh arc \
  --api-base http://127.0.0.1:8100/v1 \
  --model your_served_model_name \
  --tokenizer /path/to/local/model_or_tokenizer \
  --lm-eval-dir ../lm-evaluation-harness \
  --include-path ./tasks \
  --full
```

### 3. Individual multimodal benchmarks

- `MMBench`: `agent_eval_api/run_mmbench_api.sh`
- `VideoMME`: `agent_eval_api/run_videomme_api.sh`
- `VSI-Bench`: `agent_eval_api/run_vsibench_api.sh`
- `APTBench`: `agent_eval_api/run_aptbench_api.sh`

Example for `VideoMME` infer mode:

```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash ./run_videomme_api.sh \
  --api-base http://127.0.0.1:8100/v1 \
  --model your_served_model_name \
  --model-alias your_served_model_name \
  --run-mode infer \
  --api-nproc 4 \
  --hf-home "$FIXED_HF_CACHE" \
  --hf-hub-cache "$FIXED_HF_CACHE/hub" \
  --hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
  --output-root ./manual_runs/videomme_your_model
```

### 4. Parallel round launcher

Before running this helper, you must set:

```bash
export SNAPSHOT_R5=/path/to/round5_snapshot
export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
export FIXED_HF_CACHE=/path/to/hf_cache
export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
export VLLM_PYTHON=/path/to/envs/vllm/bin/python
```

Then run:

```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_rounds_from_bfcl_parallel.sh
```

## Benchmark Notes

- `ARC`, `RULER`, `HH-RLHF`, `AdvBench`
  - use `lm-evaluation-harness` plus local task configs under `agent_eval_api/tasks/`
- `BFCL`
  - uses the bundled BFCL code under `agent_eval_api/gorilla/berkeley-function-call-leaderboard/`
- `MMBench`, `VideoMME`
  - use `VLMEvalKit`; these wrappers are often used in `infer` mode for leaderboard submission workflows
- `VSI-Bench`
  - uses the bundled `thinking-in-space` integration and `lmms_eval` in the chosen Python env
- `APTBench`
  - uses `agent_eval_api/APTBench/code/`
- `DBBench`
  - uses the bundled `AgentBench/` code and Docker services

## Sanity Checks on a New Cluster

Before running a full evaluation, check:

```bash
cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash -n run_all_10bench_db_api.sh
bash -n run_eval_task_api.sh
bash -n run_mmbench_api.sh
bash -n run_videomme_api.sh
bash -n run_vsibench_api.sh
bash -n run_aptbench_api.sh

curl -fsS http://127.0.0.1:8100/v1/models
```

If the API responds and these scripts parse, the bundle layout is consistent.

## Scope

This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:

- install the Python environments on the target cluster
- download the models and benchmark data/cache there
- point the scripts at the new local paths

## Portability Notes

- HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.
- The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.
- The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.
- VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.

- APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.