YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Portable 10-Benchmark Eval Bundle

This bundle contains the code needed to run the 10-benchmark API evaluation stack on another cluster without shipping local datasets, caches, model weights, or run outputs.

Included

  • agent_eval_api/ with the top-level 10-benchmark runners and bundled code for APTBench, VLMEvalKit, thinking-in-space, BFCL, and local task configs
  • AgentBench/ code/config needed for DBBench
  • lm-evaluation-harness/ for ARC, RULER, HH-RLHF, and AdvBench
  • env.example.sh with cluster-specific path placeholders

Not Included

  • model checkpoints or HF snapshots
  • hf_cache/, LMUData/, downloaded benchmark data
  • local result directories such as manual_runs/, runs/, automation*/, score/, or BFCL result/
  • AgentBench/data/ and other benchmark payload data

Directory Layout

portable_10bench_eval_bundle_20260330_0140/
  README.md
  env.example.sh
  agent_eval_api/
  AgentBench/
  lm-evaluation-harness/

The patched scripts in this bundle use relative paths where possible.

Environment Setup

You still need working Python environments on the target cluster. A practical split is:

  • vllm env: model serving
  • vsibench env: VSI-Bench and current VLMEvalKit wrappers
  • BFCL env: BFCL
  • base Python env: text-benchmark wrappers and orchestration

A simple starting point is:

cd portable_10bench_eval_bundle_20260330_0140
source env.example.sh

Then edit env.example.sh to match the new cluster.

Data / Cache Prep

This bundle does not include benchmark data. Before running, prepare these inputs on the new cluster:

  1. Model weights or HF snapshots
  • For local models, download them to a local path and pass that path as --tokenizer / --model inputs.
  • For the parallel round launcher, set:
    • SNAPSHOT_R5=/path/to/round5_snapshot
    • SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
  1. Hugging Face cache for multimodal benchmarks
  • MMBench, VideoMME, and VSI-Bench expect a shared cache root.
  • Create a cache directory such as /path/to/hf_cache, then export:
export FIXED_HF_CACHE=/path/to/hf_cache

Typical subpaths used by the runners are:

  • $FIXED_HF_CACHE
  • $FIXED_HF_CACHE/hub
  • $FIXED_HF_CACHE/datasets
  • $FIXED_HF_CACHE/LMUData for MMBench
  1. Docker for DBBench
  • DBBench uses docker compose via AgentBench/extra/docker-compose.yml.
  • Make sure Docker and Compose are available on the cluster node.

Core Entry Points

1. All 10 benchmarks against one API

cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash ./run_all_10bench_db_api.sh \
  --api-base http://127.0.0.1:8100/v1 \
  --model your_served_model_name \
  --tokenizer /path/to/local/model_or_tokenizer \
  --full \
  --canonical-hf-home "$FIXED_HF_CACHE" \
  --canonical-vlmeval-cache "$FIXED_HF_CACHE" \
  --vsibench-python "$VSIBENCH_PYTHON" \
  --vlmeval-python "$VLMEVAL_PYTHON" \
  --bfcl-python "$BFCL_PYTHON" \
  --base-python "$BASE_PYTHON" \
  --aptbench-python "$APTBENCH_PYTHON" \
  --dbbench-python "$DBBENCH_PYTHON" \
  --output-root ./manual_runs/your_model_run/benchmarks \
  --tag your_model_run

2. Single text benchmark

For arc, ruler, hh_rlhf, or advbench:

cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash ./run_eval_task_api.sh arc \
  --api-base http://127.0.0.1:8100/v1 \
  --model your_served_model_name \
  --tokenizer /path/to/local/model_or_tokenizer \
  --lm-eval-dir ../lm-evaluation-harness \
  --include-path ./tasks \
  --full

3. Individual multimodal benchmarks

  • MMBench: agent_eval_api/run_mmbench_api.sh
  • VideoMME: agent_eval_api/run_videomme_api.sh
  • VSI-Bench: agent_eval_api/run_vsibench_api.sh
  • APTBench: agent_eval_api/run_aptbench_api.sh

Example for VideoMME infer mode:

cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash ./run_videomme_api.sh \
  --api-base http://127.0.0.1:8100/v1 \
  --model your_served_model_name \
  --model-alias your_served_model_name \
  --run-mode infer \
  --api-nproc 4 \
  --hf-home "$FIXED_HF_CACHE" \
  --hf-hub-cache "$FIXED_HF_CACHE/hub" \
  --hf-datasets-cache "$FIXED_HF_CACHE/datasets" \
  --output-root ./manual_runs/videomme_your_model

4. Parallel round launcher

Before running this helper, you must set:

export SNAPSHOT_R5=/path/to/round5_snapshot
export SNAPSHOT_R102025=/path/to/round10_15_20_25_snapshot
export FIXED_HF_CACHE=/path/to/hf_cache
export VSIBENCH_PYTHON=/path/to/envs/vsibench/bin/python
export VLMEVAL_PYTHON=/path/to/envs/vlmeval/bin/python
export BFCL_PYTHON=/path/to/envs/BFCL/bin/python
export VLLM_PYTHON=/path/to/envs/vllm/bin/python

Then run:

cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api
bash ./run_rounds_from_bfcl_parallel.sh

Benchmark Notes

  • ARC, RULER, HH-RLHF, AdvBench
    • use lm-evaluation-harness plus local task configs under agent_eval_api/tasks/
  • BFCL
    • uses the bundled BFCL code under agent_eval_api/gorilla/berkeley-function-call-leaderboard/
  • MMBench, VideoMME
    • use VLMEvalKit; these wrappers are often used in infer mode for leaderboard submission workflows
  • VSI-Bench
    • uses the bundled thinking-in-space integration and lmms_eval in the chosen Python env
  • APTBench
    • uses agent_eval_api/APTBench/code/
  • DBBench
    • uses the bundled AgentBench/ code and Docker services

Sanity Checks on a New Cluster

Before running a full evaluation, check:

cd portable_10bench_eval_bundle_20260330_0140/agent_eval_api

bash -n run_all_10bench_db_api.sh
bash -n run_eval_task_api.sh
bash -n run_mmbench_api.sh
bash -n run_videomme_api.sh
bash -n run_vsibench_api.sh
bash -n run_aptbench_api.sh

curl -fsS http://127.0.0.1:8100/v1/models

If the API responds and these scripts parse, the bundle layout is consistent.

Scope

This package is a reusable evaluation code bundle, not a frozen environment export. You still need to:

  • install the Python environments on the target cluster
  • download the models and benchmark data/cache there
  • point the scripts at the new local paths

Portability Notes

  • HH-RLHF and AdvBench data are not bundled. Set HH_RLHF_DATASET_PATH and ADVBENCH_DATASET_PATH on the new cluster, or place those files under datasets/hh_rlhf and datasets/advbench inside the bundle root.

  • The core scripts default to relative lm-evaluation-harness and VLMEvalKit cache paths inside this bundle.

  • The multimodal runners expect a shared cache root, typically exported as FIXED_HF_CACHE.

  • VideoMME and MMBench are commonly used in infer mode when you plan to upload predictions to an external leaderboard.

  • APTBench, BFCL, and VSI-Bench benchmark payloads are also not bundled in this package. Populate their expected data locations after you move the bundle to the new cluster.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support