Spaces:
Sleeping
Sleeping
| # SLM Agentic Benchmark Suite | |
| A uv workspace package to evaluate **local HuggingFace models** against agentic and academic benchmarks. | |
| **Docs:** [USAGE.md](USAGE.md) (commands and workflows) Β· [docs/benchmarks.md](docs/benchmarks.md) (per-benchmark reference) Β· [../USAGE.md](../USAGE.md) (full research tree) | |
| | Suite | CLI | What it measures | | |
| |---|---|---| | |
| | **Agentic** | `slm-benchmark` | BFCL, Ο-bench, GAIA, SWE-bench | | |
| | **Academic** | `slm-lm-eval` | ARC, HellaSwag, GSM8K, β¦ (lm-evaluation-harness) | | |
| ## Install | |
| From the repo root: | |
| ```bash | |
| uv sync --group evals | |
| uv sync --group lm-eval # optional: slm-lm-eval academic benchmarks | |
| ``` | |
| ## Quickstart | |
| ```bash | |
| # From repo root (recommended) | |
| uv run --package slm-evals slm-benchmark \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --benchmarks bfcl tau_bench \ | |
| --max-samples 20 | |
| # Or as a module | |
| uv run --package slm-evals python -m slm_evals.run_benchmark \ | |
| --model openbmb/MiniCPM5-1B \ | |
| --benchmarks bfcl tau_bench \ | |
| --max-samples 20 | |
| # YAML config | |
| uv run --package slm-evals slm-benchmark \ | |
| --config research/evals/configs/experiment_001.yaml | |
| ``` | |
| ## Project structure | |
| ``` | |
| research/evals/ | |
| βββ pyproject.toml | |
| βββ configs/ | |
| β βββ experiment_001.yaml | |
| βββ src/slm_evals/ | |
| β βββ run_benchmark.py | |
| β βββ benchmarks/ | |
| β β βββ base.py | |
| β β βββ bfcl.py | |
| β β βββ tau_bench.py | |
| β β βββ gaia.py | |
| β β βββ swe_bench.py | |
| β βββ utils/ | |
| β βββ model_loader.py | |
| β βββ reporter.py | |
| β βββ config_loader.py | |
| βββ results/ # created at runtime (relative to cwd) | |
| ``` | |
| ## CLI reference | |
| ``` | |
| --model Path to local HF model dir (or Hub ID) | |
| --benchmarks Space-separated: bfcl tau_bench gaia swe_bench all | |
| --config YAML config file (overrides CLI flags) | |
| --max-samples Cap samples per benchmark | |
| --output-dir Results directory (default: ./results) | |
| --experiment-name Tag for this run | |
| --device auto | cpu | cuda | cuda:0 | |
| --dtype float32 | float16 | bfloat16 | int8 | int4 | |
| --max-new-tokens Max tokens per generation (default: 512) | |
| --temperature Sampling temp (default: 0.0 = greedy) | |
| ``` | |
| ## Adding a custom benchmark | |
| 1. Create `src/slm_evals/benchmarks/my_bench.py` and subclass `BaseBenchmark`. | |
| 2. Register it in `src/slm_evals/run_benchmark.py` β `BENCHMARK_REGISTRY`. | |
| 3. Run: `uv run --package slm-evals slm-benchmark --model ./my-model --benchmarks my_bench` | |
| ## Output formats | |
| Results are written under `<output-dir>/<experiment_name>/`: | |
| - `results.json` β full structured dump | |
| - `results.csv` β one row per benchmark | |
| - `report.md` β human-readable summary | |
| ## Notes | |
| **Ο-bench user simulator**: Default is a lightweight rule-based simulator. Set `use_llm_user: true` in config for the GPT-4o user agent (API cost). | |
| **SWE-bench full eval**: Set `full_eval: true` to run the official Docker harness (`pip install swebench docker`). | |
| **GAIA tools**: Offline by default (`tool_mode: describe`). Wire real tools in `gaia.py` for live eval. | |