Spaces:

MSGEncrypted
/

lesson-agent-dev

Sleeping

App Files Files Community

lesson-agent-dev / research /evals /README.md

MSG

Merge pull request #4 from MSghais/experiment/small_model_building_testing

59e2c8a 22 days ago

preview code

Raw

History Blame Contribute Delete

3.12 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

SLM Agentic Benchmark Suite

A uv workspace package to evaluate local HuggingFace models against agentic and academic benchmarks.

Docs: USAGE.md (commands and workflows) · docs/benchmarks.md (per-benchmark reference) · ../USAGE.md (full research tree)

Suite	CLI	What it measures
Agentic	`slm-benchmark`	BFCL, τ-bench, GAIA, SWE-bench
Academic	`slm-lm-eval`	ARC, HellaSwag, GSM8K, … (lm-evaluation-harness)

Install

From the repo root:

uv sync --group evals
uv sync --group lm-eval   # optional: slm-lm-eval academic benchmarks

Quickstart

# From repo root (recommended)
uv run --package slm-evals slm-benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# Or as a module
uv run --package slm-evals python -m slm_evals.run_benchmark \
  --model openbmb/MiniCPM5-1B \
  --benchmarks bfcl tau_bench \
  --max-samples 20

# YAML config
uv run --package slm-evals slm-benchmark \
  --config research/evals/configs/experiment_001.yaml

Project structure

research/evals/
├── pyproject.toml
├── configs/
│   └── experiment_001.yaml
├── src/slm_evals/
│   ├── run_benchmark.py
│   ├── benchmarks/
│   │   ├── base.py
│   │   ├── bfcl.py
│   │   ├── tau_bench.py
│   │   ├── gaia.py
│   │   └── swe_bench.py
│   └── utils/
│       ├── model_loader.py
│       ├── reporter.py
│       └── config_loader.py
└── results/              # created at runtime (relative to cwd)

CLI reference

--model          Path to local HF model dir (or Hub ID)
--benchmarks     Space-separated: bfcl tau_bench gaia swe_bench all
--config         YAML config file (overrides CLI flags)
--max-samples    Cap samples per benchmark
--output-dir     Results directory (default: ./results)
--experiment-name  Tag for this run
--device         auto | cpu | cuda | cuda:0
--dtype          float32 | float16 | bfloat16 | int8 | int4
--max-new-tokens Max tokens per generation (default: 512)
--temperature    Sampling temp (default: 0.0 = greedy)

Adding a custom benchmark

Create src/slm_evals/benchmarks/my_bench.py and subclass BaseBenchmark.
Register it in src/slm_evals/run_benchmark.py → BENCHMARK_REGISTRY.
Run: uv run --package slm-evals slm-benchmark --model ./my-model --benchmarks my_bench

Output formats

Results are written under <output-dir>/<experiment_name>/:

results.json — full structured dump
results.csv — one row per benchmark
report.md — human-readable summary

Notes

τ-bench user simulator: Default is a lightweight rule-based simulator. Set use_llm_user: true in config for the GPT-4o user agent (API cost).

SWE-bench full eval: Set full_eval: true to run the official Docker harness (pip install swebench docker).

GAIA tools: Offline by default (tool_mode: describe). Wire real tools in gaia.py for live eval.