YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
StateShiftBench Code
This repository contains the executable evaluation harness for StateShiftBench and minimal baseline policies for review.
StateShiftBench evaluates dynamic state invalidation: whether a tool-using agent can detect that the state assumptions behind its plan have expired before committing an irreversible action.
Quick Start
Install requirements:
pip install -r requirements.txt
pip install -e .
Validate cases:
python scripts/validate_cases.py --data ../stateshiftbench_dataset/data/cases
Run a smoke evaluation:
python scripts/run_eval.py --data ../stateshiftbench_dataset/data/cases --strategy direct --limit 5 --output outputs/direct_smoke.jsonl
python scripts/summarize_results.py --episodes outputs/direct_smoke.jsonl
Run the reference state-aware policy:
python scripts/run_eval.py --data ../stateshiftbench_dataset/data/cases --strategy stact_reference --limit 5 --output outputs/stact_reference_smoke.jsonl
python scripts/summarize_results.py --episodes outputs/stact_reference_smoke.jsonl
Strategies
direct: follows the direct-commit reference trace, representing an agent that acts on the first snapshot.stact_reference: follows the State-Track-Validate-Act reference trace when available, representing state-aware behavior.
The submitted paper evaluates learned language-model policies in addition to these reference strategies. This release provides the benchmark runner, schemas, and metric computation needed to inspect the executable artifact.
Repository Structure
stateshiftbench/
environment.py
metrics.py
runner.py
schemas.py
baselines/
reference.py
scripts/
validate_cases.py
run_eval.py
summarize_results.py
tests/
test_runner_smoke.py
Anonymity
This review release is anonymized and does not include author names, institutions, private server paths, credentials, or access tokens.