StateShiftBench Code

This repository contains the executable evaluation harness for StateShiftBench and minimal baseline policies for review.

StateShiftBench evaluates dynamic state invalidation: whether a tool-using agent can detect that the state assumptions behind its plan have expired before committing an irreversible action.

Quick Start

Install requirements:

pip install -r requirements.txt
pip install -e .

Validate cases:

python scripts/validate_cases.py --data ../stateshiftbench_dataset/data/cases

Run a smoke evaluation:

python scripts/run_eval.py --data ../stateshiftbench_dataset/data/cases --strategy direct --limit 5 --output outputs/direct_smoke.jsonl
python scripts/summarize_results.py --episodes outputs/direct_smoke.jsonl

Run the reference state-aware policy:

python scripts/run_eval.py --data ../stateshiftbench_dataset/data/cases --strategy stact_reference --limit 5 --output outputs/stact_reference_smoke.jsonl
python scripts/summarize_results.py --episodes outputs/stact_reference_smoke.jsonl

Strategies

direct: follows the direct-commit reference trace, representing an agent that acts on the first snapshot.
stact_reference: follows the State-Track-Validate-Act reference trace when available, representing state-aware behavior.

The submitted paper evaluates learned language-model policies in addition to these reference strategies. This release provides the benchmark runner, schemas, and metric computation needed to inspect the executable artifact.

Repository Structure

stateshiftbench/
  environment.py
  metrics.py
  runner.py
  schemas.py
  baselines/
    reference.py
scripts/
  validate_cases.py
  run_eval.py
  summarize_results.py
tests/
  test_runner_smoke.py

Anonymity

This review release is anonymized and does not include author names, institutions, private server paths, credentials, or access tokens.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support