MSG
Merge pull request #4 from MSghais/experiment/small_model_building_testing
59e2c8a
|
Raw
History Blame Contribute Delete
3.12 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

SLM Agentic Benchmark Suite

A uv workspace package to evaluate local HuggingFace models against agentic and academic benchmarks.

Docs: USAGE.md (commands and workflows) Β· docs/benchmarks.md (per-benchmark reference) Β· ../USAGE.md (full research tree)

Suite CLI What it measures
Agentic slm-benchmark BFCL, Ο„-bench, GAIA, SWE-bench
Academic slm-lm-eval ARC, HellaSwag, GSM8K, … (lm-evaluation-harness)

Install

From the repo root:

uv sync --group evals
uv sync --group lm-eval   # optional: slm-lm-eval academic benchmarks

Quickstart

# From repo root (recommended)
uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# Or as a module
uv run --package slm-evals python -m slm_evals.run_benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# YAML config
uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/experiment_001.yaml

Project structure

research/evals/
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ configs/
β”‚   └── experiment_001.yaml
β”œβ”€β”€ src/slm_evals/
β”‚   β”œβ”€β”€ run_benchmark.py
β”‚   β”œβ”€β”€ benchmarks/
β”‚   β”‚   β”œβ”€β”€ base.py
β”‚   β”‚   β”œβ”€β”€ bfcl.py
β”‚   β”‚   β”œβ”€β”€ tau_bench.py
β”‚   β”‚   β”œβ”€β”€ gaia.py
β”‚   β”‚   └── swe_bench.py
β”‚   └── utils/
β”‚       β”œβ”€β”€ model_loader.py
β”‚       β”œβ”€β”€ reporter.py
β”‚       └── config_loader.py
└── results/              # created at runtime (relative to cwd)

CLI reference

--model          Path to local HF model dir (or Hub ID)
--benchmarks     Space-separated: bfcl tau_bench gaia swe_bench all
--config         YAML config file (overrides CLI flags)
--max-samples    Cap samples per benchmark
--output-dir     Results directory (default: ./results)
--experiment-name  Tag for this run
--device         auto | cpu | cuda | cuda:0
--dtype          float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens Max tokens per generation (default: 512)
--temperature    Sampling temp (default: 0.0 = greedy)

Adding a custom benchmark

  1. Create src/slm_evals/benchmarks/my_bench.py and subclass BaseBenchmark.
  2. Register it in src/slm_evals/run_benchmark.py β†’ BENCHMARK_REGISTRY.
  3. Run: uv run --package slm-evals slm-benchmark --model ./my-model --benchmarks my_bench

Output formats

Results are written under <output-dir>/<experiment_name>/:

  • results.json β€” full structured dump
  • results.csv β€” one row per benchmark
  • report.md β€” human-readable summary

Notes

Ο„-bench user simulator: Default is a lightweight rule-based simulator. Set use_llm_user: true in config for the GPT-4o user agent (API cost).

SWE-bench full eval: Set full_eval: true to run the official Docker harness (pip install swebench docker).

GAIA tools: Offline by default (tool_mode: describe). Wire real tools in gaia.py for live eval.