# SLM Agentic Benchmark Suite A uv workspace package to evaluate **local HuggingFace models** against agentic and academic benchmarks. **Docs:** [USAGE.md](USAGE.md) (commands and workflows) · [docs/benchmarks.md](docs/benchmarks.md) (per-benchmark reference) · [../USAGE.md](../USAGE.md) (full research tree) | Suite | CLI | What it measures | |---|---|---| | **Agentic** | `slm-benchmark` | BFCL, τ-bench, GAIA, SWE-bench | | **Academic** | `slm-lm-eval` | ARC, HellaSwag, GSM8K, … (lm-evaluation-harness) | ## Install From the repo root: ```bash uv sync --group evals uv sync --group lm-eval # optional: slm-lm-eval academic benchmarks ``` ## Quickstart ```bash # From repo root (recommended) uv run --package slm-evals slm-benchmark \ --model openbmb/MiniCPM5-1B \ --benchmarks bfcl tau_bench \ --max-samples 20 # Or as a module uv run --package slm-evals python -m slm_evals.run_benchmark \ --model openbmb/MiniCPM5-1B \ --benchmarks bfcl tau_bench \ --max-samples 20 # YAML config uv run --package slm-evals slm-benchmark \ --config research/evals/configs/experiment_001.yaml ``` ## Project structure ``` research/evals/ ├── pyproject.toml ├── configs/ │ └── experiment_001.yaml ├── src/slm_evals/ │ ├── run_benchmark.py │ ├── benchmarks/ │ │ ├── base.py │ │ ├── bfcl.py │ │ ├── tau_bench.py │ │ ├── gaia.py │ │ └── swe_bench.py │ └── utils/ │ ├── model_loader.py │ ├── reporter.py │ └── config_loader.py └── results/ # created at runtime (relative to cwd) ``` ## CLI reference ``` --model Path to local HF model dir (or Hub ID) --benchmarks Space-separated: bfcl tau_bench gaia swe_bench all --config YAML config file (overrides CLI flags) --max-samples Cap samples per benchmark --output-dir Results directory (default: ./results) --experiment-name Tag for this run --device auto | cpu | cuda | cuda:0 --dtype float32 | float16 | bfloat16 | int8 | int4 --max-new-tokens Max tokens per generation (default: 512) --temperature Sampling temp (default: 0.0 = greedy) ``` ## Adding a custom benchmark 1. Create `src/slm_evals/benchmarks/my_bench.py` and subclass `BaseBenchmark`. 2. Register it in `src/slm_evals/run_benchmark.py` → `BENCHMARK_REGISTRY`. 3. Run: `uv run --package slm-evals slm-benchmark --model ./my-model --benchmarks my_bench` ## Output formats Results are written under `//`: - `results.json` — full structured dump - `results.csv` — one row per benchmark - `report.md` — human-readable summary ## Notes **τ-bench user simulator**: Default is a lightweight rule-based simulator. Set `use_llm_user: true` in config for the GPT-4o user agent (API cost). **SWE-bench full eval**: Set `full_eval: true` to run the official Docker harness (`pip install swebench docker`). **GAIA tools**: Offline by default (`tool_mode: describe`). Wire real tools in `gaia.py` for live eval.