MSG
Merge pull request #4 from MSghais/experiment/small_model_building_testing
59e2c8a
|
Raw
History Blame Contribute Delete
3.12 kB
# SLM Agentic Benchmark Suite
A uv workspace package to evaluate **local HuggingFace models** against agentic and academic benchmarks.
**Docs:** [USAGE.md](USAGE.md) (commands and workflows) Β· [docs/benchmarks.md](docs/benchmarks.md) (per-benchmark reference) Β· [../USAGE.md](../USAGE.md) (full research tree)
| Suite | CLI | What it measures |
|---|---|---|
| **Agentic** | `slm-benchmark` | BFCL, Ο„-bench, GAIA, SWE-bench |
| **Academic** | `slm-lm-eval` | ARC, HellaSwag, GSM8K, … (lm-evaluation-harness) |
## Install
From the repo root:
```bash
uv sync --group evals
uv sync --group lm-eval # optional: slm-lm-eval academic benchmarks
```
## Quickstart
```bash
# From repo root (recommended)
uv run --package slm-evals slm-benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl tau_bench \
--max-samples 20
# Or as a module
uv run --package slm-evals python -m slm_evals.run_benchmark \
--model openbmb/MiniCPM5-1B \
--benchmarks bfcl tau_bench \
--max-samples 20
# YAML config
uv run --package slm-evals slm-benchmark \
--config research/evals/configs/experiment_001.yaml
```
## Project structure
```
research/evals/
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ configs/
β”‚ └── experiment_001.yaml
β”œβ”€β”€ src/slm_evals/
β”‚ β”œβ”€β”€ run_benchmark.py
β”‚ β”œβ”€β”€ benchmarks/
β”‚ β”‚ β”œβ”€β”€ base.py
β”‚ β”‚ β”œβ”€β”€ bfcl.py
β”‚ β”‚ β”œβ”€β”€ tau_bench.py
β”‚ β”‚ β”œβ”€β”€ gaia.py
β”‚ β”‚ └── swe_bench.py
β”‚ └── utils/
β”‚ β”œβ”€β”€ model_loader.py
β”‚ β”œβ”€β”€ reporter.py
β”‚ └── config_loader.py
└── results/ # created at runtime (relative to cwd)
```
## CLI reference
```
--model Path to local HF model dir (or Hub ID)
--benchmarks Space-separated: bfcl tau_bench gaia swe_bench all
--config YAML config file (overrides CLI flags)
--max-samples Cap samples per benchmark
--output-dir Results directory (default: ./results)
--experiment-name Tag for this run
--device auto | cpu | cuda | cuda:0
--dtype float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens Max tokens per generation (default: 512)
--temperature Sampling temp (default: 0.0 = greedy)
```
## Adding a custom benchmark
1. Create `src/slm_evals/benchmarks/my_bench.py` and subclass `BaseBenchmark`.
2. Register it in `src/slm_evals/run_benchmark.py` β†’ `BENCHMARK_REGISTRY`.
3. Run: `uv run --package slm-evals slm-benchmark --model ./my-model --benchmarks my_bench`
## Output formats
Results are written under `<output-dir>/<experiment_name>/`:
- `results.json` β€” full structured dump
- `results.csv` β€” one row per benchmark
- `report.md` β€” human-readable summary
## Notes
**Ο„-bench user simulator**: Default is a lightweight rule-based simulator. Set `use_llm_user: true` in config for the GPT-4o user agent (API cost).
**SWE-bench full eval**: Set `full_eval: true` to run the official Docker harness (`pip install swebench docker`).
**GAIA tools**: Offline by default (`tool_mode: describe`). Wire real tools in `gaia.py` for live eval.