ane-kan-runtime / docs /benchmark_controller_fsm.md
JohnGenetica's picture
Deploy ANE KAN runtime Space
201cf4d verified

Benchmark Controller FSM (Strict-safe M3 Sweep)

Component and scope

Component: benchmark orchestration path in training/kan_benchmark_suite.py that drives matrix runs, applies toolchain gates, and writes per-cell telemetry.

External dependencies: command-line arguments, toolchain manifest cache, optional training subprocess, filesystem outputs, and user cancellation.

State list (mutually exclusive)

  1. IDLE
    No run initialized. Invariants: no active case, no snapshot timer.
  2. PRE_FLIGHT
    Manifest/cached toolchain status loaded. Invariants: manifest snapshot hash set.
  3. GATE_WAIT
    Running toolchain gate checks for a case. Invariants: pending case id is current.
  4. RUN_READY
    All non-strict checks passed and base args are frozen for run. Invariants: kernel_profile, runtime_backend_plan, and sweep params are consistent.
  5. RUNNING
    One benchmark cell is executing. Invariants: run_id and seed are assigned; history stream is hot.
  6. RUN_COMPLETED
    Current cell finished and result is buffered. Invariants: final metrics exist or failure row written.
  7. TEARDOWN
    Persisting run row and cleaning per-run artifacts. Invariants: output files open.
  8. ERROR
    Hard gate or runtime failure; may still emit failure row in non-blocking cases.
  9. CANCELLED
    User-initiated cancel/unmount; active run is aborted and best-effort persisted.

Events

  • E_INIT
  • E_CONFIG_PARSED
  • E_PRE_FLIGHT_OK
  • E_PRE_FLIGHT_FAIL
  • E_GATE_CHECK_OK
  • E_GATE_CHECK_FAIL_WARN
  • E_GATE_CHECK_FAIL_ERROR
  • E_CASE_START
  • E_STEP_DONE
  • E_RUN_OK
  • E_RUN_FAIL
  • E_RETRY
  • E_CANCEL
  • E_TIMEOUT
  • E_UNMOUNT
  • E_STALE_EVENT(older_run_id)
  • E_NEW_INPUTS

Guards

  • G_strict_mode: strict coreml mode is active.
  • G_gate_requires_coreml: case path requires strict CoreML visibility.
  • G_case_requires_ane: current case runtime plan is ANE/HYBRID.
  • G_retry_budget: remaining retries > 0.
  • G_fresh: event run id matches current run_id.
  • G_output_ok: output directory writable.
  • G_cancel_requested: cancellation flag set.

Side effects

  • Build environment manifest + cache lookup (_collect_toolchain_manifest).
  • Evaluate gate (_evaluate_toolchain_gate).
  • Create per-case output directory.
  • Instantiate training args (_set_args_from_base) and invoke run_training.
  • Write per-cell run JSON + summary CSV.
  • Emit console warning/error lines.
  • On cancel/unmount: clear in-flight worker handles and skip remaining scheduled cases.

Transition table

state event guard next state actions
IDLE E_CONFIG_PARSED G_output_ok PRE_FLIGHT capture manifest and persist suite manifest
IDLE E_CONFIG_PARSED ~G_output_ok ERROR fail fast, emit manifest I/O error
PRE_FLIGHT E_PRE_FLIGHT_OK True GATE_WAIT compute suite defaults and base args
PRE_FLIGHT E_PRE_FLIGHT_FAIL True ERROR add gate diagnostics row, continue if warn
GATE_WAIT E_GATE_CHECK_OK ~G_gate_requires_coreml OR ~G_strict_mode RUN_READY record toolchain_gate_issues (empty)
GATE_WAIT E_GATE_CHECK_FAIL_WARN ~G_strict_mode RUN_READY record issues; mark warning metadata
GATE_WAIT E_GATE_CHECK_FAIL_ERROR G_strict_mode ERROR throw/fail row with coreml reason
RUN_READY E_CASE_START G_fresh AND ~G_cancel_requested RUNNING set run_id, seed, case overrides
RUN_READY E_NEW_INPUTS G_fresh RUN_READY update next-case policy and rebuild base args
RUNNING E_STEP_DONE G_fresh RUNNING append telemetry step from history stream
RUNNING E_RUN_OK G_fresh RUN_COMPLETED finalize metrics and compute row-level ratios
RUNNING E_RUN_FAIL G_fresh RUN_COMPLETED persist failure row with toolchain_gate_ok=False
RUNNING E_RUN_FAIL ~G_fresh RUNNING drop stale result, retain active run
RUNNING E_TIMEOUT G_retry_budget ERROR cancel/retry with backoff policy
RUN_COMPLETED E_CASE_START G_fresh TEARDOWN collect manifest + append run_result
RUN_COMPLETED E_CANCEL ~G_cancel_requested TEARDOWN mark incomplete row and break loops
TEARDOWN E_STEP_DONE True TEARDOWN continue writing CSV artifact updates
TEARDOWN E_RUN_OK run remaining cases GATE_WAIT schedule next case
TEARDOWN E_RUN_OK ~run remaining cases IDLE emit final report paths
ERROR E_RETRY G_retry_budget GATE_WAIT re-run last case with updated seed/backoff
ERROR E_CANCEL True CANCELLED stop scheduling, persist partial report
CANCELLED E_UNMOUNT True IDLE flush pending writes, close handles
any E_CANCEL G_cancel_requested CANCELLED set abort flag and stop future case launches

Mermaid

stateDiagram-v2
    [*] --> IDLE
    IDLE --> PRE_FLIGHT : E_CONFIG_PARSED / capture_manifest
    PRE_FLIGHT --> GATE_WAIT : E_PRE_FLIGHT_OK
    PRE_FLIGHT --> ERROR : E_PRE_FLIGHT_FAIL
    GATE_WAIT --> RUN_READY : E_GATE_CHECK_OK
    GATE_WAIT --> RUN_READY : E_GATE_CHECK_FAIL_WARN
    GATE_WAIT --> ERROR : E_GATE_CHECK_FAIL_ERROR
    RUN_READY --> RUNNING : E_CASE_START
    RUNNING --> RUNNING : E_STEP_DONE
    RUNNING --> RUN_COMPLETED : E_RUN_OK
    RUNNING --> RUN_COMPLETED : E_RUN_FAIL
    RUN_COMPLETED --> TEARDOWN : E_CASE_START
    TEARDOWN --> GATE_WAIT : next_case
    TEARDOWN --> IDLE : all_cases_done
    ERROR --> GATE_WAIT : E_RETRY
    ERROR --> CANCELLED : E_CANCEL
    CANCELLED --> IDLE : E_UNMOUNT
    RUN_READY --> CANCELLED : E_CANCEL
    RUNNING --> CANCELLED : E_CANCEL
    RUN_COMPLETED --> CANCELLED : E_CANCEL
    IDLE --> [*] : process_end

Race and stale-event handling

  • Older in-flight run events are ignored using run_id guard (G_fresh).
  • If E_NEW_INPUTS arrives while RUNNING, latest override is accepted only after current run enters TEARDOWN.
  • E_CANCEL always has priority over E_STEP_DONE and transitions directly to CANCELLED.
  • On unmount, only the latest active run ID is allowed to persist output; stale completions are dropped.

Edge-coverage tests

  1. Start in IDLE; strict gate missing-coreml in error mode => GATE_WAIT -> ERROR.
  2. Warn mode missing-coreml => GATE_WAIT -> RUN_READY with warning metadata.
  3. RUNNING stale result while next run started => stale event dropped, active run continues.
  4. E_CANCEL during RUNNING => no additional case launches after current step.
  5. Retry path from ERROR executes when retry budget remains and clears last failed case cache.
  6. toolchain_gate_coreml_issue populated only when gate failure string contains coreml keywords.