Policy Load Balancer: Risk Modes, Degradation, and Kill-Switches
Draft v0.1 — Non-normative supplement to SI-Core / SI-NOS / PoLB / EVAL / ETH specs
Terminology note (art-60 series):
- PLB is reserved for Pattern-Learning-Bridge (art-60-016).
- This document uses PoLB for Policy Load Balancer to avoid acronym collisions.
This document is non-normative. It describes how to think about and operate the Policy Load Balancer (PoLB) in a Structured Intelligence stack. Normative contracts live in the PoLB / EVAL / ETH interface specs and SI-Core / SI-NOS design docs.
0. Conventions used in this draft (non-normative)
This draft follows the portability conventions used in 069/084+ when an artifact might be exported, hashed, or attested (PoLB mode catalogs, PoLB policies, transition logs, kill-switch contracts, postmortem bundles):
created_atis operational time (advisory unless time is attested).as_ofcarries markers only (time claim + optional revocation view markers) and SHOULD declareclock_profile: "si/clock-profile/utc/v1"when exported.trustcarries digests only (trust anchors + optional revocation view digests). Never mix markers intotrust.bindingspins meaning as{id,digest}(meaningful identities must not be digest-only).- Avoid floats in policy-/digest-bound artifacts: prefer scaled integers (
*_bp,*_ppm) and integer micro/milliseconds. - If you hash/attest legal/procedural artifacts, declare canonicalization explicitly:
canonicalization: "si/jcs-strict/v1"andcanonicalization_profile_digest: "sha256:...". digest_rulestrings (when present) are explanatory only; verifiers MUST compute digests using pinned schemas/profiles, not by parsingdigest_rule.
Numeric conventions used in examples:
For weights and ratios in
[0,1], export as basis points:x_bp = round(x * 10000).For probabilities in
[0,1], export as basis points:p_bp = round(p * 10000).For very small probabilities, ppm is acceptable:
p_ppm = round(p * 1_000_000).If a clause uses a legal threshold with decimals, either:
- export
value_scaled_int+ explicitscale, or - keep the raw numeric payload in a referenced artifact and export only its digest (
*_ref/*_hash) plus display-only summaries.
- export
Internal computation may still use floats; the convention here is about exported/hashed representations.
1. Why a Policy Load Balancer exists
Once you have:
- Jumps as the unit of “intelligent move,”
- RML as the glue for external effects,
- EVAL as first-class evaluation,
- ETH as an overlay for safety and rights,
…you still face a practical question:
In which mode is the system allowed to act right now? How much risk can it take, for whom, and under which governance?
Real systems have very different operating conditions:
- “Lab sandbox at 2am with fake data”
- “Shadow-mode on production traffic, no real effects”
- “Normal online operation with small experiments”
- “Emergency: flood wall is about to be overtopped”
The Policy Load Balancer (PoLB) is the component that:
takes a risk profile (domain, current conditions, ETH constraints, EVAL results),
and selects an operational mode:
- what kinds of Jumps are allowed,
- which RML levels are permitted,
- what kind of experiments may run,
- what kind of “genius replay” or special controllers are enabled or disabled,
- under which degradation and kill-switch rules.
Think of PoLB as:
“The goal-surface for how allowed the rest of the system is to act.”
2. Mental model: PoLB as a goal surface over modes
A simple way to think about PoLB:
Inputs:
- domain & principal(s),
- risk_signals (telemetry, alerts, incidents, load),
- risk_score_bp (0..10000) derived from
risk_signalsusing a domain-specific rubric, - risk_level (low / medium / high / emergency) derived from
risk_score_bpusing band thresholds, - ETH & EVAL status (incidents, experiments, regressions),
- ID / role context (who is delegating what to whom).
Terminology note:
- risk_band is a categorical operating class (OFFLINE / SHADOW / ONLINE_SAFE / ONLINE_EXPERIMENTAL / EMERGENCY).
- risk_score_bp is a scalar severity score (basis points) used to decide transitions within/between bands.
- risk_level is a human-facing tier label derived from
risk_score_bp(not a float).
Internal:
- a mode policy: what modes exist and what they permit,
- a transition policy: how and when we move between modes,
- a degradation & kill-switch policy: what to shut off, in which order (risk vs capacity).
Output:
- an active mode descriptor used by:
- Jump runtime,
- RML execution,
- EVAL / experiment systems,
- PoLB-aware UIs and operators.
- an active mode descriptor used by:
Very roughly:
Context (risk_signals, ETH, EVAL, ID/role)
→ PoLB
→ decision: { envelope_mode, mode_name, risk_band, risk_score_bp, risk_level }
→ allowed: { jump_types, rml_levels, experiments, engines }
→ required: { human_approval, logging, ETH-hard-blocks, etc. }
Where:
- envelope_mode is the SI-Core conformance posture carried in SIR metadata: sandbox|shadow|online|degraded
- mode_name / risk_band are the finer-grained PoLB profile (operator- and analytics-facing)
We’ll treat modes as first-class structures, not booleans sprinkled across services.
3. Mode taxonomy: risk bands and capabilities
This is a non-normative but useful baseline taxonomy.
3.1 Core risk bands
risk_bands:
OFFLINE:
description: "No live user impact. Synthetic or archived data only."
SHADOW:
description: "Live inputs, no effectful RML. Observational only."
ONLINE_SAFE:
description: "Normal operation with conservative guardrails."
ONLINE_EXPERIMENTAL:
description: "Online with EVAL-governed experiments."
EMERGENCY:
description: "Time-critical emergency actions with special ETH/PoLB rules."
Each band is realized as one or more PoLB modes.
Series-wide consistency note (mapping): SI-Core conformance and most SIR examples carry a coarse polb.mode envelope (sandbox|shadow|online|degraded). This document’s risk_band / mode_name are finer-grained.
| envelope_mode | typical risk_band / mode_name prefix |
|---|---|
| sandbox | OFFLINE / OFFLINE_* |
| shadow | SHADOW / SHADOW_* |
| online | ONLINE_SAFE, ONLINE_EXPERIMENTAL / ONLINE_* |
| degraded | EMERGENCY / EMERGENCY_* (fail-safe / restricted ops profiles) |
3.2 Example PoLB modes (non-normative schema)
polb_modes:
OFFLINE_SANDBOX:
risk_band: OFFLINE
envelope_mode: "sandbox"
description: "Pure simulation, no external effects."
allowed:
jump_types: ["pure", "simulated_effectful"]
rml_levels: [] # none
experiments: "unrestricted_offline"
genius_replay: "allowed_in_sandbox"
requirements:
human_in_loop: false
eth_overlay: "record_only"
eval_required_for_changes: false
SHADOW_PROD:
risk_band: SHADOW
envelope_mode: "shadow"
description: "Live observations, but RML replaced with no-ops."
allowed:
jump_types: ["pure", "shadow_simulation"]
rml_levels: [] # no *real* effects
experiments: "shadow_only"
genius_replay: "shadow_only"
requirements:
human_in_loop: false
eth_overlay: "enforced"
eval_required_for_changes: true
logging_level: "full"
ONLINE_SAFE_BASELINE:
risk_band: ONLINE_SAFE
envelope_mode: "online"
description: "Normal operations with strict ETH and no risky experiments."
allowed:
jump_types: ["pure", "effectful"]
rml_levels: [1, 2] # confined, compensable operations
experiments: "disabled"
genius_replay: "conservative_only"
requirements:
eth_overlay: "strict"
eval_required_for_changes: true
kill_switch_enabled: true
role_limitations: ["no_unvetted_agents"]
ONLINE_EXPERIMENTAL_STRATIFIED:
risk_band: ONLINE_EXPERIMENTAL
envelope_mode: "online"
description: "EVAL-governed experiments with stratified traffic."
allowed:
jump_types: ["pure", "effectful"]
rml_levels: [1, 2, 3] # with experimental caps per domain
experiments: "eval_governed"
genius_replay: "whitelisted_patterns"
requirements:
eth_overlay: "strict"
eval_required_for_changes: true
polb_experiment_budget:
# Exported form avoids floats: shares in [0,1] use basis points (bp).
max_population_share_bp: 2500 # 0.25
vulnerable_pop_share_max_bp: 500 # 0.05
EMERGENCY_FAILSAFE:
risk_band: EMERGENCY
envelope_mode: "degraded"
description: "Emergency mode with minimal but decisive capabilities."
allowed:
jump_types: ["effectful"]
rml_levels: [2] # only strongly compensable
experiments: "disabled"
genius_replay: "prevalidated_emergency_traces_only"
requirements:
eth_overlay: "emergency_profile"
human_in_loop: true
special_audit: "post_hoc_full_review"
kill_switch_enabled: true
The actual set of modes is domain-specific, but this style of explicit policy is the key.
4. PoLB configuration as structured policy
You can think of PoLB configuration as a YAML-backed policy store that SI-Core consults on each Jump / RML request.
4.1 Example PoLB policy spec (non-normative)
polb_policy:
id: "polb-policy:cityos.flood_control/2028-04@v1.2.0"
created_at: "2028-04-01T00:00:00Z" # operational time (advisory unless time is attested)
as_of:
time: "2028-04-01T00:00:00Z"
clock_profile: "si/clock-profile/utc/v1"
trust:
trust_anchor_set_id: "si/trust-anchors/example/v1"
trust_anchor_set_digest: "sha256:..."
canonicalization: "si/jcs-strict/v1"
canonicalization_profile_digest: "sha256:..."
# Pin referenced contracts/policies by {id,digest} when exporting/hashing.
bindings:
trust_anchor_set:
id: "si/trust-anchors/example/v1"
digest: "sha256:..."
kill_switch_contract:
id: "kill-switch-contract:cityos.flood_control/KS-CITYOS-2028-01"
digest: "sha256:..."
domain: "cityos.flood_control"
version: "v1.2.0"
default_mode: "ONLINE_SAFE_BASELINE"
modes:
# See 3.2 for definitions (mode catalog lives in this document for illustration).
- name: "OFFLINE_SANDBOX"
...
- name: "SHADOW_PROD"
...
- name: "ONLINE_SAFE_BASELINE"
...
- name: "ONLINE_EXPERIMENTAL_STRATIFIED"
...
- name: "EMERGENCY_FAILSAFE"
...
transitions:
# Automatic transitions driven by telemetry & alarms.
# Exported form avoids floats: use *_bp for ratios in [0,1].
- from: "ONLINE_SAFE_BASELINE"
to: "EMERGENCY_FAILSAFE"
trigger:
type: "alarm"
condition: "flood_risk_score_bp > 9000 OR scada_link_loss == true"
min_duration_sec: 30
required_approvals: ["city_incident_commander"]
- from: "ONLINE_EXPERIMENTAL_STRATIFIED"
to: "ONLINE_SAFE_BASELINE"
trigger:
type: "eth_violation"
condition: "eth_incidents_last_10min > 0"
min_duration_sec: 0
automatic: true
# Kill-switch transitions must reference an explicit contract (see §7.1).
# When exported/hashed, the contract identity SHOULD be pinned in bindings{}.
- from: "*"
to: "SHADOW_PROD"
trigger:
type: "kill_switch"
contract_ref: "kill-switch-contract:cityos.flood_control/KS-CITYOS-2028-01"
reason_required: true
logging: "priority_audit"
degradation_orders:
# When we need to reduce *risk* (prefer less effectful operation)
risk_degrade_order:
- "ONLINE_EXPERIMENTAL_STRATIFIED"
- "ONLINE_SAFE_BASELINE"
- "SHADOW_PROD"
- "OFFLINE_SANDBOX"
# When we need to shed *capacity* (prefer lower compute / optional workloads)
# (Domain-specific; shown as an example knob, not a universal rule.)
capacity_shed_order:
- "OFFLINE_SANDBOX"
- "ONLINE_EXPERIMENTAL_STRATIFIED"
- "ONLINE_SAFE_BASELINE"
- "SHADOW_PROD"
5. PoLB in the SI-Core loop
5.1 Runtime view
At Jump runtime, the sequence is:
[OBS] assembles an observation bundle.
[ID] and Role & Persona overlay determine who is acting for whom.
[ETH] and [EVAL] provide current constraints and evaluation status.
PoLB takes (domain, risk signals, ETH/EVAL status, ID/role, MEM context) and returns an active mode.
Jump engine selects:
- which engines are allowed (LLM, world-model, genius controller, SIL-compiled…),
- which RML levels and effect types are allowed,
- whether experiments can be attached,
- what extra logging / explanations are required.
Non-normative interface sketch:
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Mapping, Sequence
BP_MAX = 10_000
def clamp_bp(x: int) -> int:
return max(0, min(BP_MAX, int(x)))
class EnvelopeMode(str, Enum):
SANDBOX = "sandbox"
SHADOW = "shadow"
ONLINE = "online"
DEGRADED = "degraded"
class RiskBand(str, Enum):
OFFLINE = "OFFLINE"
SHADOW = "SHADOW"
ONLINE_SAFE = "ONLINE_SAFE"
ONLINE_EXPERIMENTAL = "ONLINE_EXPERIMENTAL"
EMERGENCY = "EMERGENCY"
class RiskLevel(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
EMERGENCY = "emergency"
@dataclass(frozen=True)
class PolbContext:
domain: str
principals: Sequence[str]
role_id: str
risk_signals: Mapping[str, Any]
eval_status: Mapping[str, Any]
eth_status: Mapping[str, Any]
incident_flags: Mapping[str, Any] = field(default_factory=dict)
@dataclass(frozen=True)
class PolbDecision:
envelope_mode: EnvelopeMode # sandbox | shadow | online | degraded
mode_name: str # e.g., "ONLINE_EXPERIMENTAL_STRATIFIED"
risk_band: RiskBand # OFFLINE | SHADOW | ONLINE_SAFE | ONLINE_EXPERIMENTAL | EMERGENCY
risk_score_bp: int # 0..10000 (scalar severity score; exported form avoids floats)
risk_level: RiskLevel # low | medium | high | emergency (tier label derived from risk_score_bp)
allowed_jump_types: Sequence[str]
allowed_rml_levels: Sequence[int]
experiments_allowed: bool
genius_policy: str
extra_requirements: Mapping[str, Any] = field(default_factory=dict)
def __post_init__(self) -> None:
object.__setattr__(self, "risk_score_bp", clamp_bp(self.risk_score_bp))
if self.risk_band == RiskBand.EMERGENCY and self.experiments_allowed:
raise ValueError("experiments_allowed must be false in EMERGENCY risk band")
class PolicyLoadBalancer:
def decide(self, ctx: PolbContext) -> PolbDecision:
raise NotImplementedError
The important point:
Nothing effectful runs “raw.” Everything goes through PoLB and RML gates.
6. Degradation patterns: how to fail safer, not just off
PoLB is where graceful degradation is orchestrated.
Instead of a single “panic switch,” we want structured degradation patterns.
6.1 Pattern 1: Experiment-first shutdown
When incidents spike or observability degrades:
degradation_pattern: "experiment_first"
steps:
- "Disable ONLINE_EXPERIMENTAL_* modes"
- "Flush and freeze ongoing E-Jumps (experiments)"
- "Switch all evaluation traffic to control variants"
- "Keep ONLINE_SAFE_BASELINE mode if ETH/EVAL OK"
6.2 Pattern 2: Reduced-capability controllers
If certain engines look unstable (e.g. LLM regressions):
degradation_pattern: "controller_simplification"
steps:
- "Disable LLM-based decision engines for RML>=2"
- "Fallback to GeniusController + SIL-compiled controllers"
- "Narrow allowed jump_types to pre-validated templates"
- "Require human approval for high-risk Jumps"
6.3 Pattern 3: Observation-driven downgrade
From the earlier OBS / under-observation work:
degradation_pattern: "under_observation"
triggers:
# Exported form avoids floats: missing rate in [0,1] uses basis points (bp).
- "critical_obs_missing_rate_bp > 500 over 5min" # 0.05 == 5%
actions:
- "Switch from ONLINE_SAFE_BASELINE → SHADOW_PROD"
- "Freeze or cancel any in-flight effectful operations (RML>=1) pending human review"
- "Restrict new work to: pure + shadow/no-op effects only"
- "Route high-stakes decisions to human-in-loop queue"
These patterns should be pre-declared in PoLB policy, not improvised during incidents.
7. Kill-switches: who can stop what, how fast, and with what trace
A kill-switch is a special PoLB-initiated mode transition, typically to:
SHADOW_PROD(logic runs, no effects), orOFFLINE_SANDBOX(detached from live inputs), or- a minimal
EMERGENCY_FAILSAFEmode.
7.1 Kill-switch contract
Non-normative contract sketch:
kill_switch_contract:
id: "kill-switch-contract:cityos.flood_control/KS-CITYOS-2028-01"
created_at: "2028-04-01T00:00:00Z" # operational time (advisory unless time is attested)
as_of:
time: "2028-04-01T00:00:00Z"
clock_profile: "si/clock-profile/utc/v1"
trust:
trust_anchor_set_id: "si/trust-anchors/example/v1"
trust_anchor_set_digest: "sha256:..."
canonicalization: "si/jcs-strict/v1"
canonicalization_profile_digest: "sha256:..."
# Pin meaning when exporting/hashing (avoid digest-only meaning).
bindings:
trust_anchor_set:
id: "si/trust-anchors/example/v1"
digest: "sha256:..."
domain: "cityos.flood_control"
allowed_initiators:
- "human:oncall_sre:*"
- "human:city_incident_commander:*"
scope:
modes_affected: ["ONLINE_*"]
effects_affected: ["flood_gate_control", "traffic_routing"]
target_mode: "SHADOW_PROD"
requirements:
justification_required: true
multi_party_auth:
min_approvers: 2
roles: ["oncall_sre", "city_incident_commander"]
logging:
level: "priority_audit"
mem_stream: "mem://polb/kill_switches"
post_event:
mandatory_postmortem: true
freeze_policy_changes_for_hours: 24
7.2 Non-normative implementation sketch
from __future__ import annotations
import fnmatch
from dataclasses import dataclass
from datetime import datetime, timezone
from hashlib import sha256
from typing import Any, Optional, Sequence
@dataclass(frozen=True)
class KillSwitchContract:
contract_id: str
domain: str
target_mode: str
allowed_initiators: Sequence[str]
min_approvers: int = 1
class KillSwitchService:
def __init__(self, polb: "PolicyLoadBalancer", mem: Any, id_resolver: Any):
self.polb = polb
self.mem = mem
self.id_resolver = id_resolver
def trigger(
self,
initiator_id: str,
contract_id: str,
reason: str,
approver_ids: Sequence[str] = (),
idempotency_key: Optional[str] = None,
) -> str:
contract = self._load_contract(contract_id)
# Stable unique list (initiator counts as an approver).
approvers = tuple(dict.fromkeys((initiator_id, *approver_ids)))
if len(approvers) < max(1, int(contract.min_approvers)):
raise PermissionError("Insufficient approvers for kill-switch")
for pid in approvers:
if not self._authorized(pid, contract):
raise PermissionError(f"Approver not allowed: {pid}")
as_of = datetime.now(timezone.utc).isoformat()
material = f"{contract_id}|{contract.target_mode}|{initiator_id}|{reason}|{as_of}|{idempotency_key or ''}"
event_id = sha256(material.encode("utf-8")).hexdigest()
event = {
"event_id": event_id,
"kind": "polb.kill_switch",
"contract_id": contract.contract_id,
"domain": contract.domain,
"initiator": initiator_id,
"approvers": list(approvers),
"reason": reason,
"as_of": {"time": as_of, "clock_profile": "si/clock-profile/utc/v1"},
"target_mode": contract.target_mode,
}
self.mem.append("mem://polb/kill_switches", event)
# Force PoLB mode (implementation-defined, but we pin the event_id for traceability).
self.polb.force_mode(contract.target_mode, reason=reason, source="kill_switch", event_id=event_id)
return event_id
def _load_contract(self, contract_id: str) -> KillSwitchContract:
raise NotImplementedError
def _authorized(self, principal_id: str, contract: KillSwitchContract) -> bool:
# Non-normative: shell-style wildcard matching for initiator patterns.
return any(fnmatch.fnmatch(principal_id, pat) for pat in contract.allowed_initiators)
The kill-switch is not “turn everything off”; it’s “force a predefined safe mode, with full trace.”
8. Interaction with ETH, EVAL, ID, MEM
PoLB is not a free-floating component. It is heavily entangled with the invariants.
8.1 ETH × PoLB
ETH policies set hard constraints:
- some modes may be disallowed entirely for specific populations,
- some domains may never enter ONLINE_EXPERIMENTAL.
ETH incidents can:
- automatically downgrade modes (e.g. ONLINE_EXPERIMENTAL → ONLINE_SAFE),
- block transitions to riskier modes until review.
8.2 EVAL × PoLB
EVAL defines what counts as “safe enough” to:
- enable an experimental mode,
- exit an emergency mode back to baseline.
For experiments:
- PoLB ensures experiments exist only in EVAL-governed modes,
- PoLB enforces population share limits and stop rules as part of the mode.
8.3 ID × PoLB
Role & Persona overlays define who is allowed to:
- change PoLB policy,
- trigger kill-switches,
- override mode transitions.
Delegation chains must be respected and checked:
- e.g. “teacher_delegate” cannot escalate PoLB mode for the whole school.
8.4 MEM × PoLB
Every PoLB decision is a first-class event in MEM:
- who changed mode, from what to what, why, under which signals.
Postmortems and forensics rely on this trace to answer:
- “Why were we in ONLINE_EXPERIMENTAL at that time?”
- “Why didn’t we downgrade sooner?”
9. Domain sketches
9.1 CityOS
Normal:
ONLINE_SAFE_BASELINE- SI-Core adjusts flows, predicts flood risk, runs limited experiments on digital signage.
Heavy rain forecast:
- EVAL + OBS bump risk band → PoLB transitions to
ONLINE_EXPERIMENTAL_STRATIFIEDonly for simulation, while continuing safe operations.
- EVAL + OBS bump risk band → PoLB transitions to
Actual flood approaching threshold:
Risk + alarms → PoLB transitions to
EMERGENCY_FAILSAFE:- only pre-validated flood-gate Genius Traces allowed,
- RML limited to strongly compensable gate moves,
- human operator required on each Jump.
After event:
- PoLB goes to
SHADOW_PRODwhile forensics run, - later returns to
ONLINE_SAFE_BASELINEonce EVAL confirms stability.
- PoLB goes to
9.2 Learning / education
Term start:
ONLINE_SAFE_BASELINE: personalized but conservative Jumps; no experimental curricula for vulnerable learners.
Mid-term A/B on practice scheduling:
specific student segments moved into
ONLINE_EXPERIMENTAL_STRATIFIEDmode:- EVAL ensures power, fairness checks,
- PoLB enforces population limits and automatic stop rules.
Incident: stress-related ETH flag:
triggered by EVAL + OBS, PoLB:
- disables experimental modes for affected cohorts,
- routes decisions back to conservative controllers,
- notifies human teachers.
9.3 OSS / CI systems
Normal CI:
ONLINE_SAFE_BASELINE, deterministic test selection, low-risk RML only (labeling, logs).
Introducing new flaky-detector:
ONLINE_EXPERIMENTAL_STRATIFIED, stratified experiment on some repos only.
CI meltdown:
kill-switch →
SHADOW_PROD:- system still “thinks” about test selection and scheduling,
- but does not merge PRs or trigger deploys until human review.
10. Implementation path
You do not need a full SI-NOS implementation to start using PoLB-like ideas.
Pragmatic path:
Name your modes Start with a very simple taxonomy:
OFFLINE,SHADOW,ONLINE_SAFE,ONLINE_EXPERIMENTAL,EMERGENCY.
Define per-mode permissions For each mode, explicitly write:
- which APIs can be called,
- which RML-equivalent operations are allowed,
- what level of experiment / A/B is permitted.
Add a small PoLB service
- Accepts current risk context, ETH/EVAL signals.
- Returns a mode descriptor used at call-sites.
Wire critical call-sites through PoLB
- Payment, actuation, irreversible updates, etc.
- Make them refuse to act when PoLB says “no.”
Define at least one kill-switch
- Hard-coded path to
SHADOWorOFFLINEmode, - Limited to specific human roles,
- Full audit.
- Hard-coded path to
Gradually refine policies
- Introduce degradation patterns,
- integrate with EVAL and ETH,
- create mode-specific dashboards.
The deeper SI-Core / SI-NOS stack simply formalizes this into:
- PoLB policies as first-class semantic objects,
- PoLB decisions as part of the structural trace,
- PoLB/EVAL/ETH/ID/MEM all interlocked.
But the core idea stays simple:
The Policy Load Balancer is how you tell an intelligent system how bold it is currently allowed to be — and how to climb back down.
11. Experiment design algorithms
Challenge: E-Jumps must propose experiments that are statistically sound and ETH/PoLB-compatible, not just “try variant B and see what happens.”
11.1 Sample size calculation
Non-normative sketch for a simple two-arm difference-in-means case:
from __future__ import annotations
import math
from statistics import NormalDist
from typing import Any, Mapping
_nd = NormalDist()
def _z(p: float) -> float:
return _nd.inv_cdf(p)
class SampleSizeCalculator:
def calculate(self, eval_surface: Any, effect_size: float, power: float = 0.8, alpha: float = 0.05) -> Mapping[str, Any]:
"""
Compute required sample size per variant for a primary metric using a
two-sample normal approximation.
Notes (non-normative):
- Intended for rough sizing, not as a validated stats library.
- effect_size is in metric units (difference in means).
- variance should be a per-arm variance estimate for the primary metric.
"""
if effect_size is None or float(effect_size) <= 0.0:
raise ValueError("effect_size must be > 0 for sample size calculation")
if not (0.0 < float(alpha) < 1.0) or not (0.0 < float(power) < 1.0):
raise ValueError("alpha and power must be in (0,1)")
primary_metric = eval_surface.objectives.primary[0]
variance = float(self._estimate_variance(primary_metric, eval_surface.population))
if variance <= 0.0:
raise ValueError("variance must be > 0")
z_alpha = _z(1.0 - float(alpha) / 2.0)
z_beta = _z(float(power))
n = 2.0 * variance * ((z_alpha + z_beta) / float(effect_size)) ** 2
n_per_variant = int(math.ceil(n))
variants = getattr(eval_surface, "variants", None) or [0, 1]
k = max(2, len(variants))
return {
"n_per_variant": n_per_variant,
"total_n": n_per_variant * k,
"assumptions": {
"effect_size": float(effect_size),
"variance": variance,
"power": float(power),
"alpha": float(alpha),
"test": "two-sample normal approximation (sketch)",
"arms_assumed": k,
},
}
11.2 Power analysis
Given an ongoing or completed experiment, E-Jump should be able to report “how powerful was this design actually”.
from __future__ import annotations
import math
from statistics import NormalDist
from typing import Any, Mapping
_nd = NormalDist()
def _z(p: float) -> float:
return _nd.inv_cdf(p)
def calculate_n_for_power(target_power: float, effect_size: float, variance: float, alpha: float = 0.05) -> int:
"""Two-arm z-test sample size (per variant), rough approximation."""
if effect_size <= 0.0 or variance <= 0.0:
raise ValueError("effect_size and variance must be > 0")
z_alpha = _z(1.0 - float(alpha) / 2.0)
z_beta = _z(float(target_power))
n = 2.0 * float(variance) * ((z_alpha + z_beta) / float(effect_size)) ** 2
return int(math.ceil(n))
def estimate_variance(primary_metric: Any, population: Any) -> float:
"""Domain-specific variance estimator (non-normative placeholder)."""
raise NotImplementedError
def power_analysis(eval_surface: Any, observed_n: int, effect_size: float, alpha: float = 0.05) -> Mapping[str, Any]:
"""
Approximate post-hoc power for a two-arm difference-in-means z-test.
observed_n: per-variant sample size (n per arm).
effect_size: difference in means (metric units).
"""
n = max(2, int(observed_n))
primary_metric = eval_surface.objectives.primary[0]
variance = float(estimate_variance(primary_metric, eval_surface.population))
if variance <= 0.0:
raise ValueError("variance must be > 0")
# Standard error for difference of means (two independent arms)
se = math.sqrt(2.0 * variance / n)
z_effect = float(effect_size) / se
z_alpha = _z(1.0 - float(alpha) / 2.0)
# Two-sided power approximation under normality:
# P(|Z| > z_alpha) where Z ~ N(z_effect, 1)
power = (1.0 - _nd.cdf(z_alpha - z_effect)) + _nd.cdf(-z_alpha - z_effect)
power = max(0.0, min(1.0, power))
cohen_d = float(effect_size) / math.sqrt(variance)
return {
"power": power,
"cohen_d": cohen_d,
"required_for_80pct": calculate_n_for_power(
target_power=0.8,
effect_size=float(effect_size),
variance=variance,
alpha=float(alpha),
),
"assumptions": {"test": "two-sample z-test approximation (sketch)"},
}
(Real implementations will use more robust designs, but this level of reasoning is what E-Jumps should expose.)
11.3 Early stopping rules (sequential tests)
E-Jumps should design sequential monitoring plans with clear stopping rules, not ad-hoc peeking.
from __future__ import annotations
import math
from dataclasses import dataclass
from statistics import NormalDist
from typing import Any, Optional
_nd = NormalDist()
@dataclass(frozen=True)
class StopDecision:
stop: bool
reason: str = "continue"
spent_alpha: Optional[float] = None
effect_estimate: Any = None
confidence_interval: Any = None
class SequentialTestingEngine:
"""
Simple group-sequential monitoring with an O'Brien–Fleming-style *z-boundary*.
Notes (non-normative):
- This is still a sketch (not a validated sequential-testing library).
- The intent is: make the stopping rule explicit, deterministic, and traceable for PoLB/EVAL.
- We treat `_compute_test_stat` as returning a z-like statistic for a two-arm comparison.
"""
def __init__(self, alpha: float = 0.05, spending_function: str = "obrien_fleming"):
if not (0.0 < float(alpha) < 1.0):
raise ValueError("alpha must be in (0,1)")
self.alpha = float(alpha)
self.spending_function = spending_function
def check_stop(self, experiment: Any, current_data: Any, analysis_number: int, max_analyses: int) -> StopDecision:
a = max(1, int(analysis_number))
m = max(1, int(max_analyses))
test_stat, p_value = self._compute_test_stat(current_data, experiment.eval_surface)
# O'Brien–Fleming-style z boundary (two-sided).
boundary_z = self._obf_boundary_z(a, m)
nominal_alpha = 2.0 * (1.0 - _nd.cdf(boundary_z)) # already in [0, alpha]
if abs(float(test_stat)) >= boundary_z:
return StopDecision(
stop=True,
reason="efficacy",
spent_alpha=nominal_alpha,
effect_estimate=self._estimate_effect(current_data),
confidence_interval=self._ci(current_data, nominal_alpha),
)
if self._futility_check(current_data, experiment):
return StopDecision(stop=True, reason="futility", spent_alpha=nominal_alpha)
if self._harm_check(current_data, experiment.eth_constraints):
return StopDecision(stop=True, reason="eth_violation_detected", spent_alpha=nominal_alpha)
return StopDecision(stop=False, spent_alpha=nominal_alpha)
def _obf_boundary_z(self, analysis_number: int, max_analyses: int) -> float:
# Non-normative: classic OBF boundary approximation using information fraction t.
t = float(analysis_number) / float(max_analyses)
t = max(1e-9, min(1.0, t))
z_critical = _nd.inv_cdf(1.0 - self.alpha / 2.0)
return z_critical / math.sqrt(t)
11.4 E-Jump: experiment proposal algorithm
E-Jumps combine the above into a proposed design:
class ExperimentProposer:
"""
E-Jump component that proposes a concrete experiment design
given an EvalSurface, candidate policies, and an active PoLB decision.
"""
def propose_experiment(self, eval_surface, candidates, polb_decision):
"""Propose an E-Jump design consistent with PoLB / ETH / EVAL."""
# 0) PoLB gate: only propose experiments when explicitly allowed
if not getattr(polb_decision, "experiments_allowed", False):
return None # or return an offline-simulation draft, domain-specific
# 1) Minimum detectable effect from EvalSurface (or domain defaults)
effect_size = self._minimum_detectable_effect(eval_surface)
# 2) Sample size (see §11.1)
sample_size = self.sample_size_calc.calculate(eval_surface, effect_size)
# 3) Rollout strategy (PoLB-aware)
mode_name = getattr(polb_decision, "mode_name", "")
envelope_mode = getattr(polb_decision, "envelope_mode", "")
risk_band = getattr(polb_decision, "risk_band", "")
if risk_band in ("EMERGENCY",):
raise RuntimeError("Experiments must not be proposed in EMERGENCY risk band")
# Export-friendly share representation: basis points (bp) in [0, 10000]
if envelope_mode in ("shadow", "sandbox"):
rollout_mode = "shadow_only"
initial_share_bp = 0
elif str(mode_name).startswith("ONLINE_EXPERIMENTAL"):
rollout_mode = "canary"
initial_share_bp = 100 # 1%
else:
rollout_mode = "stratified"
initial_share_bp = 1000 # 10%
# Respect PoLB budget caps if present
extra = getattr(polb_decision, "extra_requirements", {}) or {}
budget = extra.get("polb_experiment_budget") or {}
cap_bp = int(budget.get("max_population_share_bp", 10_000))
initial_share_bp = max(0, min(cap_bp, int(initial_share_bp)))
# 4) Assignment scheme
assignment = self._design_assignment(
population=eval_surface.population,
candidates=candidates,
sample_size=sample_size,
polb_decision=polb_decision,
rollout_mode=rollout_mode,
initial_share_bp=initial_share_bp,
)
# 5) Monitoring plan
monitoring = MonitoringPlan(
metrics=list(eval_surface.objectives.primary) + list(getattr(eval_surface, "constraints", []) or []),
check_frequency="daily",
alert_thresholds=self._derive_alert_thresholds(eval_surface),
)
# 6) Stop rules (sequential testing)
stop_rules = self._design_stop_rules(eval_surface, sample_size)
return ExperimentJumpDraft(
assignment_scheme=assignment,
monitoring_plan=monitoring,
stop_rules=stop_rules,
expected_duration=self._estimate_duration(sample_size, eval_surface.population),
statistical_guarantees_bp={"power_bp": 8000, "alpha_bp": 500},
)
12. Multi-objective optimization in evaluation
Challenge: EvalSurface almost always has multiple objectives (e.g. learning gain, wellbeing, fairness, cost). E-Jumps must make trade-offs explicit and structured.
12.1 Pareto frontier over experiment designs
class ParetoExperimentOptimizer:
"""
Non-normative sketch for finding Pareto-optimal experiment designs
under a multi-objective EvalSurface.
"""
def find_pareto_optimal_experiments(self, eval_surface, candidate_experiments):
"""Return the Pareto frontier of experiment designs."""
evaluations = []
for exp in candidate_experiments:
scores = {}
# For each primary objective, estimate expected information gain
for obj in eval_surface.objectives.primary:
scores[obj.name] = self._predict_info_gain(exp, obj)
# Treat risk and cost as objectives to be MINIMIZED
scores["risk"] = self._assess_risk(exp, eval_surface)
scores["cost"] = self._estimate_cost(exp)
evaluations.append((exp, scores))
pareto_set = []
for i, (exp_i, scores_i) in enumerate(evaluations):
dominated = False
for j, (exp_j, scores_j) in enumerate(evaluations):
if i == j:
continue
if self._dominates(scores_j, scores_i, eval_surface):
dominated = True
break
if not dominated:
pareto_set.append((exp_i, scores_i))
return pareto_set
12.2 Scalarization methods
For many control flows, we eventually need a single scalar to compare variants.
def weighted_scalarization(eval_surface, outcomes_bp: dict) -> int:
"""
Convert multi-objective outcomes into a single scalar score (basis points, integer).
Convention (non-normative):
- objective weights are expressed as weight_bp (0..10000). If only weight (float) exists,
we derive weight_bp = round(weight * 10000).
- outcomes are expected to be normalized to basis points per objective (outcomes_bp[name]).
"""
score_bp = 0
for objective in eval_surface.objectives.primary:
weight_bp = int(getattr(objective, "weight_bp", round(float(objective.weight) * 10_000)))
outcome_bp = int(outcomes_bp.get(objective.name, 0))
score_bp += (weight_bp * outcome_bp) // 10_000
# Hard constraints behave like vetoes with large penalties
for constraint in eval_surface.constraints.hard:
if not _check_constraint(constraint, outcomes_bp):
return -10**9 # effectively remove from consideration
return int(score_bp)
12.3 Multi-objective bandits
For continuous evaluation, we want bandit algorithms that respect EvalSurface structure.
class MultiObjectiveBandit:
"""Thompson sampling with scalarization over multiple objectives."""
def __init__(self, eval_surface, candidates):
self.eval_surface = eval_surface
self.candidates = candidates # list[Candidate]
self.posteriors = {c.id: self._init_posterior() for c in candidates}
def select_arm(self):
"""Select a candidate using Thompson sampling on scalarized objectives."""
samples = {}
for cand in self.candidates:
objective_samples = {}
for obj in self.eval_surface.objectives.primary:
objective_samples[obj.name] = self.posteriors[cand.id][obj.name].sample()
samples[cand.id] = self._scalarize(objective_samples, self.eval_surface)
best_id = max(samples, key=samples.get)
return next(c for c in self.candidates if c.id == best_id)
def update(self, cand_id, outcomes: dict):
"""Update posteriors with observed metric outcomes."""
for obj in self.eval_surface.objectives.primary:
self.posteriors[cand_id][obj.name].update(outcomes[obj.name])
12.4 Constraint handling
constraint_handling_strategies:
hard_constraints:
strategy: "Feasibility preservation"
implementation: "Reject candidates that violate constraints"
soft_constraints:
strategy: "Penalty method"
implementation: "Add penalty term to scalarized objective"
chance_constraints:
strategy: "Probabilistic satisfaction"
implementation: "Require Pr(constraint satisfied) ≥ threshold"
13. Continuous evaluation and adaptive experiments
Challenge: Static A/B tests are often too slow or wasteful. We want continuous evaluation that is still ETH/PoLB-compatible.
13.1 Multi-armed bandit integration
class BanditEvaluator:
"""Continuous evaluation via multi-armed bandits."""
def __init__(self, eval_surface, candidates, algorithm: str = "thompson_sampling"):
self.eval_surface = eval_surface
self.candidates = candidates
if algorithm == "thompson_sampling":
self.bandit = ThompsonSamplingBandit(candidates)
elif algorithm == "ucb":
self.bandit = UCBBandit(candidates)
else:
raise ValueError(f"Unknown algorithm: {algorithm}")
def run_episode(self, principal, context):
"""
One evaluation 'episode':
- choose a candidate,
- execute the corresponding Jump,
- observe outcome,
- update bandit posterior.
"""
candidate = self.bandit.select_arm()
result = self.execute_jump(principal, context, candidate)
outcome = self.measure_outcome(result, self.eval_surface)
self.bandit.update(candidate.id, outcome)
return result
13.2 Contextual bandits for personalization
class ContextualBanditEvaluator:
"""Personalized evaluation via contextual bandits."""
def __init__(self, eval_surface, candidates, feature_extractor, posteriors):
self.eval_surface = eval_surface
self.candidates = candidates
self.feature_extractor = feature_extractor
self.posteriors = posteriors # per-candidate Bayesian linear models
def select_candidate(self, context):
"""Select a candidate based on context features."""
features = self.feature_extractor.extract(context)
samples = {}
for cand in self.candidates:
theta_sample = self.posteriors[cand.id].sample()
samples[cand.id] = np.dot(theta_sample, features)
best_id = max(samples, key=samples.get)
return next(c for c in self.candidates if c.id == best_id)
def update(self, cand_id, context, outcome):
"""Bayesian linear regression update."""
features = self.feature_extractor.extract(context)
self.posteriors[cand_id].update(features, outcome)
13.3 Adaptive experiment design
class AdaptiveExperimentDesigner:
"""Adjust experiment allocations based on accumulated evidence."""
def adapt_traffic_allocation(self, experiment, current_results):
"""
Reallocate traffic to better-performing variants,
subject to minimum allocation and ETH/PoLB constraints.
"""
posteriors = {}
for variant in experiment.variants:
posteriors[variant.id] = self._compute_posterior(variant, current_results)
# Probability each variant is best
prob_best = {}
for variant in experiment.variants:
prob_best[variant.id] = self._prob_best(posteriors, variant.id)
# Reallocate proportional to prob_best with a minimum share
new_allocations = {}
for variant in experiment.variants:
new_allocations[variant.id] = max(
0.05, # minimum 5% for exploration (tunable)
prob_best[variant.id],
)
# Normalize to sum to 1
total = sum(new_allocations.values())
return {k: v / total for k, v in new_allocations.items()}
13.4 Regret bounds (non-normative)
regret_analysis:
thompson_sampling:
bound: "O(√(K T log T))"
notes: "K = number of candidates, T = time horizon"
ucb:
bound: "O(√(K T log T))"
notes: "Often tighter constants than TS in practice"
contextual:
bound: "O(d √(T log T))"
notes: "d = feature dimension"
These are not contractual guarantees in SI-Core, but they inform why certain algorithms are acceptable under PoLB/EVAL for continuous evaluation.
14. Causal inference for evaluation
Challenge: Even with good randomization, we face confounding, selection bias, and heterogeneous treatment effects. E-Jumps should natively support causal views of evaluation, not just correlation.
14.1 A simple causal evaluation frame
class CausalEvaluator:
"""Causal inference helpers for experiments and observational data."""
def estimate_treatment_effect(self, data, treatment_var, outcome_var, covariates):
"""
Estimate the average treatment effect (ATE) with confounding adjustment
via inverse-propensity weighting (IPW).
Assumes no unmeasured confounding given covariates.
"""
# Propensity score model
ps_model = LogisticRegression()
ps_model.fit(data[covariates], data[treatment_var])
propensity = ps_model.predict_proba(data[covariates])[:, 1]
treated = data[treatment_var] == 1
weights = np.where(treated, 1 / propensity, 1 / (1 - propensity))
ate = (
np.mean(weights[treated] * data.loc[treated, outcome_var]) -
np.mean(weights[~treated] * data.loc[~treated, outcome_var])
)
return {
"ate": ate,
"std_error": self._bootstrap_se(
data, treatment_var, outcome_var, weights
),
}
14.2 Heterogeneous treatment effects (CATE)
class HTEEstimator:
"""Conditional average treatment effect (CATE) estimator."""
def estimate_cate(self, data, treatment, outcome, features):
"""
Simple T-learner with random forests.
Real implementations may use causal forests or other specialized models.
"""
from sklearn.ensemble import RandomForestRegressor
treated_data = data[data[treatment] == 1]
control_data = data[data[treatment] == 0]
model_treated = RandomForestRegressor().fit(
treated_data[features], treated_data[outcome]
)
model_control = RandomForestRegressor().fit(
control_data[features], control_data[outcome]
)
def cate(x):
return (
model_treated.predict([x])[0] -
model_control.predict([x])[0]
)
return cate
14.3 Instrumental variables
def instrumental_variable_estimation(data, instrument, treatment, outcome, controls):
"""
Two-stage least squares (2SLS) for IV estimation.
Assumes:
- instrument affects outcome only via treatment,
- instrument is correlated with treatment,
- instrument independent of outcome error term conditional on controls.
"""
# First stage: treatment ~ instrument + controls
first_stage = sm.OLS(
data[treatment],
sm.add_constant(data[[instrument] + controls]),
).fit()
treatment_hat = first_stage.fittedvalues
# Second stage: outcome ~ treatment_hat + controls
second_stage = sm.OLS(
data[outcome],
sm.add_constant(pd.DataFrame({
"treatment_hat": treatment_hat,
**{c: data[c] for c in controls},
})),
).fit()
return {
"effect": second_stage.params["treatment_hat"],
"se": second_stage.bse["treatment_hat"],
"first_stage_f": first_stage.fvalue,
}
The point for SI-Core is not the specific estimators, but that EVAL has a causal mode, and that PoLB/ETH can require causal checks for certain domains.
15. Performance and scalability of evaluation systems
Challenge: Evaluation is itself a high-throughput, low-latency subsystem: assignment, logging, aggregation, monitoring.
15.1 Assignment service scaling
class ScalableAssignmentService:
"""Low-latency, high-throughput variant assignment."""
def __init__(self):
self.experiment_cache = ExperimentCache() # LRU or similar
self.assignment_cache = AssignmentCache() # deterministic mapping
self.async_logger = AsyncLogger()
def assign(self, principal_id, experiment_id, context):
"""
Sub-10ms p99 variant assignment, no synchronous DB writes
in the critical path.
"""
# Deterministic assignment cache
cached = self.assignment_cache.get(principal_id, experiment_id)
if cached:
return cached
experiment = self.experiment_cache.get(experiment_id)
variant = self._fast_assign(principal_id, experiment)
# Fire-and-forget logging
self.async_logger.log_assignment(principal_id, experiment_id, variant)
# Populate cache
self.assignment_cache.put(principal_id, experiment_id, variant)
return variant
def _fast_assign(self, principal_id, experiment):
"""
Deterministic assignment respecting traffic shares.
NOTE:
- Avoid language/runtime-dependent hash() (Python hash is salted per process).
- Use a stable digest (sha256) and basis-point buckets for portability.
"""
import hashlib
key = f"{principal_id}:{experiment.id}:{experiment.salt}".encode("utf-8")
digest = hashlib.sha256(key).digest()
# 0..9999 bucket (basis points)
bucket = int.from_bytes(digest[:4], "big") % 10000
cumulative_bp = 0
for variant in experiment.variants:
share_bp = getattr(variant, "traffic_share_bp", None)
if share_bp is None:
# Fallback if variant uses float share (non-normative): convert once, locally.
share = float(getattr(variant, "traffic_share", 0.0))
share_bp = int(round(share * 10000))
cumulative_bp += int(share_bp)
if bucket < cumulative_bp:
return variant
return experiment.control
15.2 Streaming metrics aggregation
class StreamingMetricsAggregator:
"""Approximate real-time metrics with bounded memory."""
def __init__(self):
self.sketches = {} # (exp_id, variant, metric) → sketch
def update(self, experiment_id, variant, metric_name, value):
"""Incrementally update the sketch for a metric."""
key = (experiment_id, variant, metric_name)
if key not in self.sketches:
self.sketches[key] = self._init_sketch(metric_name)
# e.g., t-digest for quantiles, count-min sketch for counts
self.sketches[key].update(value)
def query(self, experiment_id, variant, metric_name, quantile=0.5):
"""Query an approximate quantile."""
key = (experiment_id, variant, metric_name)
return self.sketches[key].quantile(quantile)
15.3 Performance budgets (non-normative)
latency_budgets_p99:
assignment_service: 10ms
eth_check: 5ms
metrics_logging: 2ms # asynchronous
throughput_targets:
assignments_per_second: 100000
metrics_updates_per_second: 1000000
These budgets can be encoded into EvalSurface/PoLB for infra-like domains.
16. Experiment governance and approval processes
Challenge: High-risk experiments should not be “just code”. They need structured governance, tied into ID / ETH / PoLB.
16.1 Approval workflow
experiment_approval_workflow:
stage_1_proposal:
required:
- "EvalSurface definition"
- "PoLB configuration (mode, risk band)"
- "ETH constraints"
- "Sample size and power justification"
submitted_by: "experiment_designer"
stage_2_risk_assessment:
assessor: "domain_expert + eth_board"
criteria:
- "PoLB mode appropriate for risk level"
- "Hard ETH constraints cover key safety concerns"
- "Population share within limits"
- "Stop rules adequate and ETH-aware"
outputs:
- "risk_level: low|medium|high"
- "required_reviewers"
stage_3_review:
low_risk:
reviewers: ["technical_lead"]
turnaround: "2 business days"
medium_risk:
reviewers: ["technical_lead", "domain_expert"]
turnaround: "5 business days"
high_risk:
reviewers: ["technical_lead", "domain_expert", "eth_board", "legal"]
turnaround: "10 business days"
requires: "formal_ethics_committee_approval"
stage_4_monitoring:
automated:
- "stop_rule_checks"
- "eth_violation_detection"
human:
- "weekly_review_for_high_risk"
- "monthly_review_for_medium"
16.2 Risk rubric
class ExperimentRiskAssessor:
"""Systematic risk assessment for experiments."""
def assess_risk(self, experiment):
score = 0
envelope_mode = getattr(experiment.polb_config, "envelope_mode", None)
mode_name = getattr(experiment.polb_config, "mode_name", None)
# PoLB envelope: online is riskier than shadow/sandbox
if envelope_mode == "online":
score += 3
elif envelope_mode in ("shadow", "sandbox"):
score += 1
# Domain risk
domain = getattr(experiment.subject, "domain", None)
if domain in ["medical", "city_critical", "finance"]:
score += 3
# Population share (export-friendly): basis points
# Prefer explicit *_bp fields to avoid float ambiguity.
max_share_bp = getattr(experiment.polb_config, "max_population_share_bp", None)
if max_share_bp is None:
# Fallback: treat missing as "unknown" and be conservative.
score += 2
else:
if max_share_bp > 2500: # 0.25
score += 2
# Missing hard ETH constraints is a big red flag
if not getattr(experiment.eth_constraints, "hard", None):
score += 2
# Vulnerable populations
if self._involves_vulnerable_pop(experiment.population):
score += 3
if score <= 3:
return "low"
elif score <= 7:
return "medium"
else:
return "high"
16.3 Ethics committee integration
ethics_committee_review:
triggered_by:
- "risk_level == high"
- "involves_vulnerable_populations"
- "novel_experimental_design"
review_packet:
- "Experiment proposal (EvalSurface)"
- "Risk assessment report"
- "Informed consent procedures (if applicable)"
- "Data handling and retention plan"
- "Monitoring and stop rules"
- "Post-experiment analysis and rollback plan"
committee_decision:
- "Approved as proposed"
- "Approved with modifications"
- "Deferred pending additional information"
- "Rejected"
These pieces tie E-Jumps back into ID (who proposed/approved), ETH (what’s allowed), PoLB (which modes permit what), and MEM (full audit trail).