Spaces:

Flamehaven
/

stem-bio-ai

Running

App Files Files Community

stem-bio-ai / docs /CALIBRATION_PROFILE_DESIGN.md

Flamehaven Initiative

release: v1.8.0 mica runtime uplift

d647970 1 day ago

preview code

raw

history blame contribute delete

33.7 kB

	# STEM BIO-AI Calibration Profile Architecture

	Version: 1.8.0
	Status: implemented mirror-only calibration contract with derive/simulate preview surfaces; 1.8.0 preview hardening complete; authoritative read-through remains future work

	---

	## 1. Current State

	STEM BIO-AI already separates formal scoring, deterministic diagnostics, regulatory traceability, and AI advisory into distinct lanes.

	As of `1.8.0`, the repository ships a real calibration architecture:

	- packaged profiles in `policy/`
	- schema and runtime validation
	- result metadata surfacing for active profile identity
	- CLI policy visibility via `stem policy list`, `stem policy explain`, and `--policy <name>`
	- researcher-intent preview surfaces via `stem policy derive` and `stem policy simulate`

	What is still not fully separated is the authoritative score read-through surface:

	- stage weights
	- tier boundaries
	- clinical caps and hard floors
	- evidence-only versus score-authoritative detector status
	- reasoning-model status labels

	In `1.8.0`, most score-affecting values are still implemented as runtime constants plus prose in `SCORING_RATIONALE.md`, even though mirror-only profile metadata, CLI-visible profile selection, and derive/simulate preview surfaces are already live. That is acceptable for the current release line, but it still creates a long-term maintenance risk:

	> if calibration values are easy to change but hard to govern, the architecture will drift even if the lane boundaries remain conceptually correct.

	This document describes the implemented versioned calibration profile architecture and the remaining governed path to authoritative read-through.

	---

	## 2. Problem Statement

	As advisory systems become stronger, teams usually feel pressure to:

	- raise or lower tier thresholds
	- relax or tighten clinical caps
	- promote diagnostics from evidence-only into score-bearing logic
	- add new penalties or soften old ones
	- reinterpret reasoning or advisory outputs as scoring evidence

	If those changes happen ad hoc in code, three problems appear:

	1. the formal score becomes harder to reproduce across versions
	2. policy drift hides inside implementation edits
	3. advisory or diagnostic signals may slowly leak into the formal score without an explicit governance decision

	The issue is not whether calibration should ever change.

	The issue is whether calibration changes are:

	- versioned
	- reviewable
	- explainable in artifacts
	- bounded by explicit promotion rules

	---

	## 3. Design Goal

	The goal is not to let users arbitrarily tune the score from the CLI.

	The goal is to make calibration:

	- easy to inspect
	- easy to version
	- easy to compare between releases
	- hard to mutate accidentally

	In short:

	> STEM BIO-AI needs easy maintenance, not easy drift.

	---

	## 4. Core Principle

	Calibration should become a policy object, not a scattered implementation detail.

	That means:

	- the active scoring profile is represented in a versioned file
	- result artifacts record which profile was used
	- policy changes become visible release events
	- score-affecting changes require explicit promotion criteria

	This preserves the current architectural discipline:

	- formal score remains deterministic
	- diagnostics can stay evidence-only until promoted
	- advisory remains structurally subordinate to the score
	- regulatory mapping remains traceability support, not a score multiplier

	---

	## 5. Implemented Shape

	Current packaged profile files:

	`policy/scoring_profile.default.v1.json`
	`policy/scoring_profile.strict_clinical_adjacency.v1.json`

	Deferred profiles:

	- `reproducibility_first`
	- `documentation_lenient`
	- `research_repo_baseline`
	- `biosecurity_cautious`

	Important restriction:

	- normal users should select from named profiles
	- normal users should not pass arbitrary weights or tier cutoffs on the command line

	Good:

	```bash
	stem scan <repo> --policy default
	stem scan <repo> --policy benchmark-candidate
	```

	Bad:

	```bash
	stem scan <repo> --stage1-weight 0.35 --t3-threshold 68 --cap 72
	```

	The first preserves governance.
	The second turns the tool into an untracked tuning console.

	---

	## 6. Current Profile Contract

	Current shipped fields:

	```json
	{
	"policy_schema_version": "1",
	"policy_version": "ca-policy-1.0",
	"tool_version_introduced": "1.6.5",
	"tool_version_last_validated": "1.8.0",
	"profile_name": "default",
	"profile_status": "authoritative_release",
	"profile_read_mode": "mirror_only",
	"weights": {
	"stage_1_percent": 40,
	"stage_2r_percent": 20,
	"stage_3_percent": 40
	},
	"stage_baselines": {
	"stage_1": 60,
	"stage_2r": 60,
	"stage_3": 0
	},
	"tier_policy": {
	"tier_names": ["T0", "T1", "T2", "T3", "T4"],
	"tier_boundaries": [40, 55, 70, 85],
	"boundary_semantics": "left_closed_right_open",
	"score_domain": "integer_0_to_100"
	},
	"clinical_policy": {
	"ca_no_disclaimer_cap": 69,
	"t0_hard_floor_cap": 39
	},
	"code_integrity_policy": {
	"C1_penalty": 10,
	"C2_score_affecting": false,
	"C3_score_affecting": false,
	"C4_score_affecting": false
	},
	"stage_3_policy": {
	"normalization": {
	"kind": "linear_round",
	"raw_max": 80,
	"target_max": 100,
	"rounding": "half_up_int"
	}
	},
	"diagnostic_policy": {
	"BIO_smiles_surface_integrity": "evidence_only",
	"BIO_smiles_rdkit_validation": "evidence_only",
	"BIO_smiles_parser_guard": "evidence_only",
	"BIO_silent_mock_fallback": "evidence_only",
	"BIO_traceability_manifest_surface": "evidence_only",
	"BIO_subprocess_run_trace": "evidence_only"
	},
	"reasoning_policy": {
	"status": "diagnostic_only_uncalibrated_initial_prior",
	"score_integration": "forbidden"
	},
	"governance_sources": {
	"ca_taxonomy_version": "ca-taxonomy-v1",
	"ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
	}
	}
	```

	This is the active shipped schema family in `1.8.0`.

	Schema notes:

	- weights should be stored as integer percentages, not floating-point fractions
	- tier boundaries should be stored once as a single ordered array
	- normalization should be represented as named semantics plus parameters, not a free-form expression string
	- `policy_version` should be independent from the tool release version
	- `profile_read_mode` must distinguish mirror-only exposure from authoritative runtime loading
	- `stage_3_policy.b2_partial_credit_mode` is currently a declared mirror-only profile field; authoritative Stage 3 B2 scoring in `1.8.0` still follows the hardcoded scanner path and does not yet read this value directly
	- `governance_sources.ca_taxonomy_version` must increment whenever runtime CA trigger membership, severity mapping, or cap-relevant phrase semantics change

	Current `profile_status` state set:

	- `preview_only`
	- `experimental`
	- `benchmark_candidate`
	- `authoritative_release`
	- `deprecated`

	Current status transition path:

	`preview_only -> experimental -> benchmark_candidate -> authoritative_release -> deprecated`

	Other transitions should require an explicit migration note.

	---

	## 7. Artifact Requirements

	Every result object should record:

	- `policy_schema_version`
	- `policy_version`
	- `profile_name`
	- `profile_status`
	- `profile_read_mode`
	- `policy_sha256`

	Why:

	- two runs are not meaningfully comparable unless they share the same active profile
	- policy drift should be visible in the artifact itself
	- benchmark comparisons should be able to say whether differences came from repository evidence or policy revision
	- mirror-only and authoritative-read runs must not look equivalent in artifacts

	`policy_sha256` must be defined precisely.

	Recommended definition:

	- canonicalize the policy JSON using sorted keys and UTF-8 encoding
	- exclude the `policy_sha256` field itself from the hash input
	- hash the canonicalized policy file bytes only

	In `mirror_only` mode, the profile file may leave `policy_sha256` as `null`.

	The runtime artifact should still surface the computed canonical hash so profile comparisons remain stable during Phase 1.

	This hash does not claim to represent every runtime governance source.

	Instead, runtime governance dependencies such as the CA taxonomy should be surfaced separately under `governance_sources`.

	Recommended JSON example:

	```json
	"calibration_profile": {
	"policy_schema_version": "1",
	"policy_version": "ca-policy-1.0",
	"profile_name": "default",
	"profile_status": "authoritative_release",
	"profile_read_mode": "mirror_only",
	"policy_sha256": "..."
	},
	"governance_sources": {
	"ca_taxonomy_version": "ca-taxonomy-v1",
	"ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
	}
	```

	---

	## 8. Diagnostics Graduation Policy

	The hardest maintenance problem is not weight tuning.

	It is detector promotion:

	> when does an evidence-only detector become score-authoritative?

	Recommended detector states:

	- `evidence_only`
	- `candidate_scored`
	- `scored`
	- `deprecated`

	Recommended transition rules:

	\| From \| To \| Allowed? \| Notes \|
	\|---\|---\|---\|---\|
	\| `evidence_only` \| `candidate_scored` \| yes \| requires promotion gate below \|
	\| `candidate_scored` \| `scored` \| yes \| requires promotion gate below \|
	\| `candidate_scored` \| `evidence_only` \| yes \| allowed if benchmark review regresses confidence \|
	\| `scored` \| `candidate_scored` \| yes \| allowed for rollback after release observation \|
	\| `scored` \| `deprecated` \| yes \| allowed when detector is retired \|
	\| `evidence_only` \| `deprecated` \| yes \| allowed when detector is abandoned \|
	\| `deprecated` \| any active state \| no by default \| require explicit redesign note \|

	Recommended promotion gate before moving from `evidence_only` to `candidate_scored`:

	1. commit-pinned benchmark fixtures exist for at least `N >= 20` repositories
	2. detector output is reproducible across `3` consecutive identical runs
	3. false-positive review has been documented with observed `false_positive_rate <= 0.05`
	4. a release note explains what changed
	5. `SCORING_RATIONALE.md` is updated if the detector affects score logic

	Recommended promotion gate before moving from `candidate_scored` to `scored`:

	1. benchmark evidence shows the detector improves review precision on the maintained fixture set
	2. at least one release cycle of observation has occurred
	3. the profile change is versioned as a policy revision

	This is the governance mechanism that prevents “AI got more capable, so we quietly started scoring with it.”

	---

	## 9. Advisory Boundary Rule

	The calibration profile should explicitly state that advisory output cannot rewrite the formal score unless a future architecture intentionally changes that rule.

	Recommended field:

	```json
	"reasoning_policy": {
	"status": "diagnostic_only_uncalibrated_initial_prior",
	"score_integration": "forbidden"
	}
	```

	That matters because boundary failures often begin as convenience:

	- a provider looks helpful
	- the advisory output seems more nuanced
	- a team wants to “just incorporate it a little”

	Once that happens without a versioned policy change, the formal score stops being what the architecture claims it is.

	---

	## 10. CLI Policy Surface

	Recommended CLI behavior:

	- `--policy default`
	- `--policy <named_profile>`
	- `--list-policies`

	Not recommended:

	- direct numeric overrides for weights, thresholds, caps, or detector promotion state

	Developer-only experimental override support is acceptable if all of the following are true:

	- it is clearly marked non-authoritative
	- it writes a different `profile_status`
	- output artifacts visibly say experimental policy was used
	- it is excluded from default examples and documentation

	---

	## 11. Researcher UX and Participation Model

	The most important UX constraint is this:

	> researchers should be able to influence policy intent without turning the CLI into a free-form scoring console.

	That means STEM BIO-AI should prefer:

	- named profile templates
	- guided questions
	- side-by-side score diffs
	- explicit promotion to shared policy

	over:

	- raw numeric knobs
	- hidden threshold editing
	- untracked one-off scoring profiles

	### 11.1 Starting Point: Profile Templates

	The first interaction should not be:

	> "enter your own weights and caps"

	It should be:

	> "which evaluation posture best matches your repository context?"

	Current active named profiles:

	- `default`
	- `strict_clinical_adjacency`

	Deferred named profiles:

	- `reproducibility_first`
	- `research_repo_baseline`
	- `documentation_lenient`
	- `biosecurity_cautious`

	These names are easier for researchers to reason about than raw numbers.

	### 11.2 Guided Policy Builder

	After a template is selected, the next layer should be a guided builder rather than free-form editing.

	Examples of acceptable questions:

	- "Should code-integrity evidence outweigh README surface evidence?"
	- "Should clinical-adjacent claims trigger stricter caps?"
	- "Should bias/limitations require structured sections rather than term presence?"
	- "Should replication evidence matter more for your workflow?"

	The user answers policy questions.

	The system translates them into profile deltas.

	This preserves usability while keeping the policy surface inspectable.

	### 11.2.1 Researcher Intent Scale

	Before users touch any named policy, STEM BIO-AI should reduce the interpretation gap between:

	- what the researcher actually cares about
	- what the default profile currently emphasizes

	The implemented mechanism is a researcher intent layer built around short `1–5` scales.

	Important boundary:

	> the `1–5` scale is a UX input surface, not part of the formal score engine.

	In other words:

	- users do not set formal weights directly
	- users do not set tier thresholds directly
	- users do not generate a score by summing their answers

	Instead, the scale helps the system infer which existing policy posture is closer to the researcher's intent.

	Recommended scale interpretation:

	- `1` = minimal emphasis
	- `2` = light emphasis
	- `3` = moderate emphasis
	- `4` = strong emphasis
	- `5` = very strong emphasis

	Current question areas:

	- how strict clinical-adjacent claims should be treated
	- whether code-integrity evidence should outweigh README/documentation evidence
	- how much reproducibility evidence should matter
	- whether structured limitations should be required before partial credit is awarded

	This approach borrows the usability advantage of Likert-style scales without turning the scanner into a free-form tuning instrument.

	### 11.2.2 Why the Scale Belongs in UX, Not Scoring

	Researchers can usually answer:

	> "clinical-adjacent claims should be treated very strictly"

	more reliably than:

	> "set the CA cap delta to -12 and reduce the Stage 1 weight by 0.05"

	That is why the scale should live in the interview layer.

	The formal engine should still consume:

	- named profiles
	- explicit policy objects
	- versioned calibration state

	The scale is only a translation surface between human intent and governed policy.

	### 11.2.3 Translation Rule

	The system should map researcher answers to one of three outcomes:

	1. recommend an existing named profile
	2. show a preview-only profile delta
	3. show that the default profile already matches the stated posture under an explicit rule

	This is safer than letting the user edit raw scoring parameters directly.

	The current implementation uses an auditable rule table instead of a hidden similarity function.

	Current intent variables:

	- `clinical_strictness`
	- `code_integrity_priority`
	- `reproducibility_priority`
	- `structured_limitations_requirement`

	Current `1.8.0` decision rules:

	\| Condition \| Outcome \|
	\|---\|---\|
	\| `clinical_strictness >= 4` and `reproducibility_priority <= 3` \| recommend `strict_clinical_adjacency` \|
	\| all four values are `2` or `3` \| keep `default` \|
	\| no named profile rule matches \| generate `preview_only` profile delta from explicit bounded deltas only \|

	This narrow table is intentional. It keeps the translation layer visible, reviewable, and testable without pretending that every strong posture already has a release-grade named profile. In particular, `reproducibility_first` remains deferred in `1.8.0`; high reproducibility answers still fall back to `preview_only` Stage 4 emphasis rather than a named recommendation.

	Rule priority:

	- evaluate rules top-down
	- stop at the first named-profile match
	- if multiple strong postures are simultaneously requested and no single named profile dominates, fall back to `preview_only`

	Example:

	- `clinical_strictness = 4`
	- `reproducibility_priority = 4`

	This should fall back to `preview_only` in the initial implementation rather than pretending one named posture has clear priority.

	Lower-bound meaning:

	- `1` means minimal emphasis
	- `1` does not remove or disable an axis
	- therefore the minimum scale value still participates in threshold checks such as `<= 2`

	Current "default already matches" rule:

	- if the selected baseline is `default`
	- and all four intent variables are in the `2..3` range
	- and no explicit named-profile rule is triggered

	then the system should report that the default profile already matches the stated posture closely enough to avoid a custom preview.

	Current `preview_only` boundary:

	- do not compute nearest-profile distance
	- do not infer hidden similarity scores
	- do not mutate arbitrary raw numbers

	Instead:

	- start from the selected baseline profile
	- apply explicit bounded deltas associated with the triggered answers
	- mark the result as `preview_only`

	This keeps the intent layer auditable during the first implementation cycle.

	Current bounded deltas used in preview-only mode:

	\| Triggered answer \| Allowed `preview_only` delta shape \|
	\|---\|---\|
	\| `clinical_strictness >= 4` with no named-profile match \| switch to stricter CA posture only; do not change unrelated weights \|
	\| `reproducibility_priority >= 4` with no named-profile match \| raise Stage 4 emphasis only within predeclared policy bounds \|
	\| `structured_limitations_requirement >= 4` with no named-profile match \| require stricter Stage 3 B2 partial-credit posture only \|
	\| multiple strong answers with no named-profile match \| combine only explicitly listed bounded deltas; do not infer new arithmetic outside documented policy fields \|

	These are active preview-only deltas in `1.8.0`. They are not hidden similarity operations and they do not mutate the authoritative scan path.

	### 11.2.4 Comparison Output

	This subsection describes the immediate output of the intent-scale flow.

	If the scale is used, the system should immediately show:

	- chosen baseline profile
	- recommended profile or preview delta
	- score difference on the current repository
	- tier difference on the current repository
	- which policy dimensions changed

	The key UX question is not:

	> "what settings changed?"

	It is:

	> "what did the repository outcome change to, and why?"

	### 11.2.5 Current Named Profile Definitions

	The current implementation defines two named profiles:

	- `default`
	- `strict_clinical_adjacency`

	Deferred until explicitly defined:

	- `documentation_lenient`
	- not active in the `1.8.0` rule table
	- `research_repo_baseline`
	- not active in the `1.8.0` rule table
	- `biosecurity_cautious`
	- not active in the `1.8.0` rule table
	- `reproducibility_first`
	- intentionally deferred until an actual policy diff exists and a release-grade recommendation path is defined

	Documented diff fields per active profile:

	- stage weights
	- clinical cap / hard-floor posture
	- Stage 3 B2 strictness posture
	- Stage 4 emphasis posture

	Current starter diff:

	\| Profile \| Stage weights \| Clinical posture \| B2 posture \| Stage 4 posture \|
	\|---\|---\|---\|---\|---\|
	\| `default` \| `0.40 / 0.20 / 0.40` \| standard CA cap / hard-floor rules \| structured boundary language for partial credit \| current baseline \|
	\| `strict_clinical_adjacency` \| `0.40 / 0.20 / 0.40` \| tighter `ca_no_disclaimer_cap=60`, tighter `t0_hard_floor_cap=35` \| same as `default` \| `baseline` \|

	Any additional named profile must document its concrete diff here before it becomes eligible for CLI recommendation.

	### 11.2.6 Next UX Step: `simulate --profile-file`

	The next reasonable researcher-facing extension is a simulation-only profile file input.

	Recommended shape:

	```bash
	stem policy simulate <repo> --profile-file my_profile.json
	```

	This should be allowed because it improves domain experimentation without weakening the authoritative scan contract.

	Intended behavior:

	- load a local profile file through the same schema and runtime validation path
	- treat the file as simulation-only input
	- do not register the file as an installed named profile automatically
	- do not let `scan --policy` or `gate --policy` consume it on the authoritative path

	Required guardrails:

	- schema-valid before simulation starts
	- `profile_read_mode` must remain `mirror_only` or `preview_only` for external profile-file simulation
	- artifact output must clearly mark the file as local simulation input, not a packaged release profile
	- profile hash should still be surfaced so simulation outputs remain comparable

	Recommended artifact labels for this future path:

	- `profile_name = external_profile_file`
	- `profile_status = preview_only`
	- `profile_source = local_file`
	- `policy_sha256 = <computed canonical hash>`

	The goal is to let researchers try domain-specific posture proposals without creating a backdoor around governed named-profile promotion.

	### 11.3 Side-by-Side Simulation

	This subsection describes the more general simulation surface, which may also be used outside the intent-scale interview.

	The most important feedback loop is not the profile editor itself.

	It is the comparison view.

	Researchers should be able to see:

	- default profile result
	- custom profile result
	- score delta
	- tier delta
	- cap / hard-floor delta
	- which evidence lanes changed the outcome

	The right question is not:

	> "what numbers did I change?"

	It is:

	> "what review outcome changed, and why?"

	### 11.4 Roles

	The intended governance model is not "every researcher defines the official score alone."

	Recommended roles:

	- `researcher`
	- explains domain priorities
	- surfaces false positives / false negatives
	- evaluates whether the default policy fits the repository context
	- `policy steward`
	- maintains release-grade named profiles
	- reviews score-affecting changes
	- prevents silent drift between personal and team policy
	- `tool`
	- computes policy diff
	- computes result diff
	- records profile metadata in artifacts

	This division keeps domain experts involved without sacrificing reproducibility.

	### 11.4.1 Researcher Participation Rules

	The participation model should stay simple:

	- researchers may propose posture changes
	- researchers may run `derive` and `simulate`
	- researchers may edit personal or branch-local preview profiles for comparison work
	- researchers should not directly redefine the release-grade default policy on the authoritative score path

	The key distinction is:

	- `preview_only` and `experimental` are valid spaces for domain input
	- `authoritative_release` is a governed release artifact

	This means a researcher can legitimately say:

	> "for this domain, I want stricter clinical-adjacent treatment and stronger reproducibility emphasis"

	but should not unilaterally convert that statement into:

	- new official tier boundaries
	- new default caps
	- new score-bearing detector promotion
	- new release-grade policy semantics

	The system should therefore optimize for:

	- easy posture expression
	- easy repository-specific simulation
	- hard-to-mutate official policy

	### 11.4.2 Operating Principles

	Operationally, the collaboration rule is:

	1. the researcher expresses domain priorities
	2. the tool translates those priorities into a visible named-profile recommendation or `preview_only` delta
	3. the policy steward decides whether that posture remains local, becomes experimental, or is promoted into a release-grade policy artifact

	In practice:

	- a researcher should be able to explore calibration without editing scanner code
	- a steward should be able to reject score-affecting drift even when the local preview is reasonable
	- artifacts should clearly distinguish personal preview from official release policy

	The intended output is not "personalized truth."

	The intended output is:

	- a stable official score policy
	- a transparent preview lane for domain-specific posture testing
	- an explicit governance path between the two

	### 11.4.3 Responsibility Matrix

	\| Action \| Researcher \| Policy steward \| Tool \|
	\|---\|---\|---\|---\|
	\| express domain posture \| primary \| optional review \| guided input surface \|
	\| run `stem policy derive` \| primary \| optional review \| translates intent \|
	\| run `stem policy simulate` \| primary \| optional review \| computes baseline vs preview \|
	\| edit `preview_only` deltas for local exploration \| allowed \| review optional \| validates bounded deltas \|
	\| create or modify `experimental` named profiles \| propose \| approve / reject \| validates profile schema and metadata \|
	\| promote a profile to `benchmark_candidate` \| propose evidence \| required approval \| records status transition \|
	\| promote a profile to `authoritative_release` \| provide domain rationale \| required approval \| requires parity / benchmark metadata \|
	\| change default release policy semantics \| no unilateral authority \| required owner \| records artifact provenance \|
	\| change score-bearing detector status \| no unilateral authority \| required owner \| enforces transition metadata \|

	### 11.5 Promotion Path

	Recommended progression:

	1. personal preview profile
	2. side-by-side run against real repository output
	3. review and comparison against default profile
	4. promotion to named team policy if approved

	That promotion should update:

	- `profile_name`
	- `profile_status`
	- `policy_sha256`
	- changelog / rationale references when score logic changes

	### 11.5.1 Promotion Gates

	The progression above should not be symbolic only.

	Each transition should have an explicit gate:

	\| From \| To \| Minimum gate \|
	\|---\|---\|---\|
	\| `preview_only` \| `experimental` \| profile file exists, schema-valid, bounded diff documented, repository-side simulation reviewed \|
	\| `experimental` \| `benchmark_candidate` \| compared against default on named fixtures or benchmark repos, intended score deltas explained, no hidden arithmetic \|
	\| `benchmark_candidate` \| `authoritative_release` \| parity or benchmark note completed, rationale updated, changelog entry prepared, steward approval recorded \|
	\| `authoritative_release` \| `deprecated` \| replacement or retirement note recorded, artifact comparability preserved \|

	The intent is to prevent one common failure mode:

	> a locally useful domain tweak quietly becoming the new official score policy without an explicit release decision

	### 11.5.2 What Researchers Can Change Directly

	Researchers should be allowed to directly change:

	- intent-scale answers
	- selected baseline profile for simulation
	- branch-local `preview_only` deltas inside documented bounds
	- explanatory notes attached to why a preview better matches the domain

	Researchers should not directly change, on the authoritative path:

	- default release profile semantics
	- tier boundaries for the official score
	- detector graduation state
	- score-bearing penalty activation rules
	- release-grade policy status labels

	If a domain team wants one of those changes, the correct path is:

	1. simulate locally
	2. capture the proposed diff
	3. compare against default on real repositories
	4. propose promotion through the governed profile path

	### 11.6 Interface Direction

	Recommended near-term CLI additions:

	- `stem policy list`
	- `stem policy explain <name>`
	- `stem scan <repo> --policy <name>`

	Recommended later UX additions:

	- wizard-style policy derivation
	- side-by-side simulation view
	- profile diff explanation panel
	- `simulate --profile-file <path>` for governed local experimentation without named-profile promotion

	The guiding rule is:

	> researchers should tune posture through explicit choices, not hidden arithmetic.

	---

	## 12. Implemented Milestones and Remaining Roadmap

	### 1.6.5: mirror-only profile contract

	- create the profile file
	- keep scanner behavior unchanged
	- expose profile metadata in JSON output
	- verify the profile matches current release behavior exactly using a differential fixture set with gold outputs

	Recommended fixture format:

	```json
	{
	"fixture_name": "default_profile_parity_small_repo",
	"target_repo": "tests/fixtures/repos/small_repo_a",
	"expected": {
	"raw_score_before_floor": 67,
	"final_score": 67,
	"formal_tier": "T2 Caution",
	"score_cap": null
	}
	}
	```

	Recommended fixture location:

	- `tests/fixtures/calibration_profiles/`

	Phase 1 parity target fields:

	- `raw_score_before_floor`
	- `final_score`
	- `formal_tier`
	- `classification.score_cap`

	### 1.6.6: policy visibility

	- added `stem policy list` and `stem policy explain`
	- added `--policy <name>` on scan/gate/advisory workflows
	- surfaced selected profile metadata in stdout, Markdown, explain text, and PDF headers

	### 1.6.7: derive/simulate preview

	- added `stem policy derive` for auditable intent translation, now standardized on the governed `1–5` posture scale
	- added `stem policy simulate <repo>` for baseline-vs-preview outcome comparison
	- kept derive/simulate outputs mirror-only so authoritative scoring remains unchanged

	### 1.6.8: preview hardening and citation readiness

	- kept mirror-only scan scoring unchanged while hardening preview simulation against future profile drift
	- aligned `simulate` with profile-aware C1 penalty behavior instead of assuming scanner constants forever
	- revalidated `preview_only` profiles after bounded deltas are applied
	- strengthened mirror-only wording across CLI and report surfaces so `scan --policy` is not confused with `policy simulate`
	- added `CITATION.cff` and `.zenodo.json` so release artifacts are ready for DOI-backed citation once GitHub releases are archived by Zenodo

	### Remaining roadmap

	- authoritative read-through of policy weights/caps/thresholds
	- additional release-grade named profiles beyond `strict_clinical_adjacency`
	- explicit read-through for currently declared mirror-only fields such as `stage_3_policy.b2_partial_credit_mode`
	- `ca-taxonomy-vN` governance policy so runtime trigger-set changes are versioned as first-class release events
	- Phase 2 target release remains intentionally unset until parity fixtures, differential tests, and rollback notes are ready for the first score-authoritative read-through patch
	- future score-affecting policy changes require:
	- profile update
	- rationale update
	- changelog entry
	- benchmark note when relevant

	This keeps calibration governance ahead of personalization while avoiding a risky “big bang” rewrite.

	---

	## 13. Non-Goals

	This architecture does not aim to:

	- let users personalize trust scores
	- introduce hidden model-based calibration
	- turn regulatory mapping into a numerical score multiplier
	- make advisory output part of the formal score
	- replace benchmark or manual review with profile editing

	It only aims to make calibration easier to maintain without weakening lane boundaries.

	---

	## 15. Draft: `reproducibility_first`

	`reproducibility_first` is still deferred as an active named recommendation, but a draft posture is reasonable now.

	Draft intent:

	- researchers want reproducibility evidence to matter more in review posture
	- they do not necessarily want stricter clinical caps
	- they are usually asking for stronger replication scrutiny, not a different claim-risk philosophy

	Draft profile shape:

	\| Profile \| Stage weights \| Clinical posture \| B2 posture \| Stage 4 posture \|
	\|---\|---\|---\|---\|---\|
	\| `reproducibility_first` (draft) \| `0.40 / 0.20 / 0.40` \| same as `default` \| same as `default` \| `stronger_than_baseline` \|

	Recommended initial metadata:

	- `profile_status = experimental`
	- `profile_read_mode = mirror_only`
	- `policy_version = ca-policy-1.0-repro-first`

	Important limitation:

	In the current engine, Stage 4 remains a separate replication lane and does not alter `score.final_score`.

	That means `reproducibility_first` is currently useful for:

	- simulation posture comparison
	- replication-lane emphasis
	- future promotion groundwork

	but not yet for:

	- score-authoritative final-score change on the formal scan path

	So the draft should remain deferred until one of the following is true:

	1. there is a release-grade rationale for how stronger replication posture should affect official review outcomes
	2. the simulation/report surface exposes a meaningful replication-posture difference without pretending it changed the formal score
	3. a future Phase 2 read-through explicitly defines what parts of Stage 4 posture become authoritative and under what governance rule

	Until then, high reproducibility intent should continue to:

	- fall back to `preview_only`
	- raise Stage 4 emphasis in simulation
	- avoid pretending that an official release-grade named profile already exists

	---

	## 14. Final Position

	STEM BIO-AI now has a real calibration architecture.

	The correct mechanism is:

	> a versioned calibration profile with explicit promotion rules

	not:

	> ad hoc runtime tuning knobs

	If the system is serious about preserving the distinction between:

	- formal score
	- diagnostics
	- regulatory traceability
	- advisory

	then calibration must be governed with the same discipline.

	The right outcome is not “more adjustable.”

	The right outcome is:

	more maintainable, without becoming easier to drift.