stem-bio-ai / docs /CALIBRATION_PROFILE_DESIGN.md
Flamehaven Initiative
release: v1.8.0 mica runtime uplift
d647970

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

STEM BIO-AI Calibration Profile Architecture

Version: 1.8.0 Status: implemented mirror-only calibration contract with derive/simulate preview surfaces; 1.8.0 preview hardening complete; authoritative read-through remains future work


1. Current State

STEM BIO-AI already separates formal scoring, deterministic diagnostics, regulatory traceability, and AI advisory into distinct lanes.

As of 1.8.0, the repository ships a real calibration architecture:

  • packaged profiles in policy/
  • schema and runtime validation
  • result metadata surfacing for active profile identity
  • CLI policy visibility via stem policy list, stem policy explain, and --policy <name>
  • researcher-intent preview surfaces via stem policy derive and stem policy simulate

What is still not fully separated is the authoritative score read-through surface:

  • stage weights
  • tier boundaries
  • clinical caps and hard floors
  • evidence-only versus score-authoritative detector status
  • reasoning-model status labels

In 1.8.0, most score-affecting values are still implemented as runtime constants plus prose in SCORING_RATIONALE.md, even though mirror-only profile metadata, CLI-visible profile selection, and derive/simulate preview surfaces are already live. That is acceptable for the current release line, but it still creates a long-term maintenance risk:

if calibration values are easy to change but hard to govern, the architecture will drift even if the lane boundaries remain conceptually correct.

This document describes the implemented versioned calibration profile architecture and the remaining governed path to authoritative read-through.


2. Problem Statement

As advisory systems become stronger, teams usually feel pressure to:

  • raise or lower tier thresholds
  • relax or tighten clinical caps
  • promote diagnostics from evidence-only into score-bearing logic
  • add new penalties or soften old ones
  • reinterpret reasoning or advisory outputs as scoring evidence

If those changes happen ad hoc in code, three problems appear:

  1. the formal score becomes harder to reproduce across versions
  2. policy drift hides inside implementation edits
  3. advisory or diagnostic signals may slowly leak into the formal score without an explicit governance decision

The issue is not whether calibration should ever change.

The issue is whether calibration changes are:

  • versioned
  • reviewable
  • explainable in artifacts
  • bounded by explicit promotion rules

3. Design Goal

The goal is not to let users arbitrarily tune the score from the CLI.

The goal is to make calibration:

  • easy to inspect
  • easy to version
  • easy to compare between releases
  • hard to mutate accidentally

In short:

STEM BIO-AI needs easy maintenance, not easy drift.


4. Core Principle

Calibration should become a policy object, not a scattered implementation detail.

That means:

  • the active scoring profile is represented in a versioned file
  • result artifacts record which profile was used
  • policy changes become visible release events
  • score-affecting changes require explicit promotion criteria

This preserves the current architectural discipline:

  • formal score remains deterministic
  • diagnostics can stay evidence-only until promoted
  • advisory remains structurally subordinate to the score
  • regulatory mapping remains traceability support, not a score multiplier

5. Implemented Shape

Current packaged profile files:

policy/scoring_profile.default.v1.json policy/scoring_profile.strict_clinical_adjacency.v1.json

Deferred profiles:

  • reproducibility_first
  • documentation_lenient
  • research_repo_baseline
  • biosecurity_cautious

Important restriction:

  • normal users should select from named profiles
  • normal users should not pass arbitrary weights or tier cutoffs on the command line

Good:

stem scan <repo> --policy default
stem scan <repo> --policy benchmark-candidate

Bad:

stem scan <repo> --stage1-weight 0.35 --t3-threshold 68 --cap 72

The first preserves governance.
The second turns the tool into an untracked tuning console.


6. Current Profile Contract

Current shipped fields:

{
  "policy_schema_version": "1",
  "policy_version": "ca-policy-1.0",
  "tool_version_introduced": "1.6.5",
  "tool_version_last_validated": "1.8.0",
  "profile_name": "default",
  "profile_status": "authoritative_release",
  "profile_read_mode": "mirror_only",
  "weights": {
    "stage_1_percent": 40,
    "stage_2r_percent": 20,
    "stage_3_percent": 40
  },
  "stage_baselines": {
    "stage_1": 60,
    "stage_2r": 60,
    "stage_3": 0
  },
  "tier_policy": {
    "tier_names": ["T0", "T1", "T2", "T3", "T4"],
    "tier_boundaries": [40, 55, 70, 85],
    "boundary_semantics": "left_closed_right_open",
    "score_domain": "integer_0_to_100"
  },
  "clinical_policy": {
    "ca_no_disclaimer_cap": 69,
    "t0_hard_floor_cap": 39
  },
  "code_integrity_policy": {
    "C1_penalty": 10,
    "C2_score_affecting": false,
    "C3_score_affecting": false,
    "C4_score_affecting": false
  },
  "stage_3_policy": {
    "normalization": {
      "kind": "linear_round",
      "raw_max": 80,
      "target_max": 100,
      "rounding": "half_up_int"
    }
  },
  "diagnostic_policy": {
    "BIO_smiles_surface_integrity": "evidence_only",
    "BIO_smiles_rdkit_validation": "evidence_only",
    "BIO_smiles_parser_guard": "evidence_only",
    "BIO_silent_mock_fallback": "evidence_only",
    "BIO_traceability_manifest_surface": "evidence_only",
    "BIO_subprocess_run_trace": "evidence_only"
  },
  "reasoning_policy": {
    "status": "diagnostic_only_uncalibrated_initial_prior",
    "score_integration": "forbidden"
  },
  "governance_sources": {
    "ca_taxonomy_version": "ca-taxonomy-v1",
    "ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
  }
}

This is the active shipped schema family in 1.8.0.

Schema notes:

  • weights should be stored as integer percentages, not floating-point fractions
  • tier boundaries should be stored once as a single ordered array
  • normalization should be represented as named semantics plus parameters, not a free-form expression string
  • policy_version should be independent from the tool release version
  • profile_read_mode must distinguish mirror-only exposure from authoritative runtime loading
  • stage_3_policy.b2_partial_credit_mode is currently a declared mirror-only profile field; authoritative Stage 3 B2 scoring in 1.8.0 still follows the hardcoded scanner path and does not yet read this value directly
  • governance_sources.ca_taxonomy_version must increment whenever runtime CA trigger membership, severity mapping, or cap-relevant phrase semantics change

Current profile_status state set:

  • preview_only
  • experimental
  • benchmark_candidate
  • authoritative_release
  • deprecated

Current status transition path:

preview_only -> experimental -> benchmark_candidate -> authoritative_release -> deprecated

Other transitions should require an explicit migration note.


7. Artifact Requirements

Every result object should record:

  • policy_schema_version
  • policy_version
  • profile_name
  • profile_status
  • profile_read_mode
  • policy_sha256

Why:

  • two runs are not meaningfully comparable unless they share the same active profile
  • policy drift should be visible in the artifact itself
  • benchmark comparisons should be able to say whether differences came from repository evidence or policy revision
  • mirror-only and authoritative-read runs must not look equivalent in artifacts

policy_sha256 must be defined precisely.

Recommended definition:

  • canonicalize the policy JSON using sorted keys and UTF-8 encoding
  • exclude the policy_sha256 field itself from the hash input
  • hash the canonicalized policy file bytes only

In mirror_only mode, the profile file may leave policy_sha256 as null.

The runtime artifact should still surface the computed canonical hash so profile comparisons remain stable during Phase 1.

This hash does not claim to represent every runtime governance source.

Instead, runtime governance dependencies such as the CA taxonomy should be surfaced separately under governance_sources.

Recommended JSON example:

"calibration_profile": {
  "policy_schema_version": "1",
  "policy_version": "ca-policy-1.0",
  "profile_name": "default",
  "profile_status": "authoritative_release",
  "profile_read_mode": "mirror_only",
  "policy_sha256": "..."
},
"governance_sources": {
  "ca_taxonomy_version": "ca-taxonomy-v1",
  "ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
}

8. Diagnostics Graduation Policy

The hardest maintenance problem is not weight tuning.

It is detector promotion:

when does an evidence-only detector become score-authoritative?

Recommended detector states:

  • evidence_only
  • candidate_scored
  • scored
  • deprecated

Recommended transition rules:

From To Allowed? Notes
evidence_only candidate_scored yes requires promotion gate below
candidate_scored scored yes requires promotion gate below
candidate_scored evidence_only yes allowed if benchmark review regresses confidence
scored candidate_scored yes allowed for rollback after release observation
scored deprecated yes allowed when detector is retired
evidence_only deprecated yes allowed when detector is abandoned
deprecated any active state no by default require explicit redesign note

Recommended promotion gate before moving from evidence_only to candidate_scored:

  1. commit-pinned benchmark fixtures exist for at least N >= 20 repositories
  2. detector output is reproducible across 3 consecutive identical runs
  3. false-positive review has been documented with observed false_positive_rate <= 0.05
  4. a release note explains what changed
  5. SCORING_RATIONALE.md is updated if the detector affects score logic

Recommended promotion gate before moving from candidate_scored to scored:

  1. benchmark evidence shows the detector improves review precision on the maintained fixture set
  2. at least one release cycle of observation has occurred
  3. the profile change is versioned as a policy revision

This is the governance mechanism that prevents “AI got more capable, so we quietly started scoring with it.”


9. Advisory Boundary Rule

The calibration profile should explicitly state that advisory output cannot rewrite the formal score unless a future architecture intentionally changes that rule.

Recommended field:

"reasoning_policy": {
  "status": "diagnostic_only_uncalibrated_initial_prior",
  "score_integration": "forbidden"
}

That matters because boundary failures often begin as convenience:

  • a provider looks helpful
  • the advisory output seems more nuanced
  • a team wants to “just incorporate it a little”

Once that happens without a versioned policy change, the formal score stops being what the architecture claims it is.


10. CLI Policy Surface

Recommended CLI behavior:

  • --policy default
  • --policy <named_profile>
  • --list-policies

Not recommended:

  • direct numeric overrides for weights, thresholds, caps, or detector promotion state

Developer-only experimental override support is acceptable if all of the following are true:

  • it is clearly marked non-authoritative
  • it writes a different profile_status
  • output artifacts visibly say experimental policy was used
  • it is excluded from default examples and documentation

11. Researcher UX and Participation Model

The most important UX constraint is this:

researchers should be able to influence policy intent without turning the CLI into a free-form scoring console.

That means STEM BIO-AI should prefer:

  • named profile templates
  • guided questions
  • side-by-side score diffs
  • explicit promotion to shared policy

over:

  • raw numeric knobs
  • hidden threshold editing
  • untracked one-off scoring profiles

11.1 Starting Point: Profile Templates

The first interaction should not be:

"enter your own weights and caps"

It should be:

"which evaluation posture best matches your repository context?"

Current active named profiles:

  • default
  • strict_clinical_adjacency

Deferred named profiles:

  • reproducibility_first
  • research_repo_baseline
  • documentation_lenient
  • biosecurity_cautious

These names are easier for researchers to reason about than raw numbers.

11.2 Guided Policy Builder

After a template is selected, the next layer should be a guided builder rather than free-form editing.

Examples of acceptable questions:

  • "Should code-integrity evidence outweigh README surface evidence?"
  • "Should clinical-adjacent claims trigger stricter caps?"
  • "Should bias/limitations require structured sections rather than term presence?"
  • "Should replication evidence matter more for your workflow?"

The user answers policy questions.

The system translates them into profile deltas.

This preserves usability while keeping the policy surface inspectable.

11.2.1 Researcher Intent Scale

Before users touch any named policy, STEM BIO-AI should reduce the interpretation gap between:

  • what the researcher actually cares about
  • what the default profile currently emphasizes

The implemented mechanism is a researcher intent layer built around short 1–5 scales.

Important boundary:

the 1–5 scale is a UX input surface, not part of the formal score engine.

In other words:

  • users do not set formal weights directly
  • users do not set tier thresholds directly
  • users do not generate a score by summing their answers

Instead, the scale helps the system infer which existing policy posture is closer to the researcher's intent.

Recommended scale interpretation:

  • 1 = minimal emphasis
  • 2 = light emphasis
  • 3 = moderate emphasis
  • 4 = strong emphasis
  • 5 = very strong emphasis

Current question areas:

  • how strict clinical-adjacent claims should be treated
  • whether code-integrity evidence should outweigh README/documentation evidence
  • how much reproducibility evidence should matter
  • whether structured limitations should be required before partial credit is awarded

This approach borrows the usability advantage of Likert-style scales without turning the scanner into a free-form tuning instrument.

11.2.2 Why the Scale Belongs in UX, Not Scoring

Researchers can usually answer:

"clinical-adjacent claims should be treated very strictly"

more reliably than:

"set the CA cap delta to -12 and reduce the Stage 1 weight by 0.05"

That is why the scale should live in the interview layer.

The formal engine should still consume:

  • named profiles
  • explicit policy objects
  • versioned calibration state

The scale is only a translation surface between human intent and governed policy.

11.2.3 Translation Rule

The system should map researcher answers to one of three outcomes:

  1. recommend an existing named profile
  2. show a preview-only profile delta
  3. show that the default profile already matches the stated posture under an explicit rule

This is safer than letting the user edit raw scoring parameters directly.

The current implementation uses an auditable rule table instead of a hidden similarity function.

Current intent variables:

  • clinical_strictness
  • code_integrity_priority
  • reproducibility_priority
  • structured_limitations_requirement

Current 1.8.0 decision rules:

Condition Outcome
clinical_strictness >= 4 and reproducibility_priority <= 3 recommend strict_clinical_adjacency
all four values are 2 or 3 keep default
no named profile rule matches generate preview_only profile delta from explicit bounded deltas only

This narrow table is intentional. It keeps the translation layer visible, reviewable, and testable without pretending that every strong posture already has a release-grade named profile. In particular, reproducibility_first remains deferred in 1.8.0; high reproducibility answers still fall back to preview_only Stage 4 emphasis rather than a named recommendation.

Rule priority:

  • evaluate rules top-down
  • stop at the first named-profile match
  • if multiple strong postures are simultaneously requested and no single named profile dominates, fall back to preview_only

Example:

  • clinical_strictness = 4
  • reproducibility_priority = 4

This should fall back to preview_only in the initial implementation rather than pretending one named posture has clear priority.

Lower-bound meaning:

  • 1 means minimal emphasis
  • 1 does not remove or disable an axis
  • therefore the minimum scale value still participates in threshold checks such as <= 2

Current "default already matches" rule:

  • if the selected baseline is default
  • and all four intent variables are in the 2..3 range
  • and no explicit named-profile rule is triggered

then the system should report that the default profile already matches the stated posture closely enough to avoid a custom preview.

Current preview_only boundary:

  • do not compute nearest-profile distance
  • do not infer hidden similarity scores
  • do not mutate arbitrary raw numbers

Instead:

  • start from the selected baseline profile
  • apply explicit bounded deltas associated with the triggered answers
  • mark the result as preview_only

This keeps the intent layer auditable during the first implementation cycle.

Current bounded deltas used in preview-only mode:

Triggered answer Allowed preview_only delta shape
clinical_strictness >= 4 with no named-profile match switch to stricter CA posture only; do not change unrelated weights
reproducibility_priority >= 4 with no named-profile match raise Stage 4 emphasis only within predeclared policy bounds
structured_limitations_requirement >= 4 with no named-profile match require stricter Stage 3 B2 partial-credit posture only
multiple strong answers with no named-profile match combine only explicitly listed bounded deltas; do not infer new arithmetic outside documented policy fields

These are active preview-only deltas in 1.8.0. They are not hidden similarity operations and they do not mutate the authoritative scan path.

11.2.4 Comparison Output

This subsection describes the immediate output of the intent-scale flow.

If the scale is used, the system should immediately show:

  • chosen baseline profile
  • recommended profile or preview delta
  • score difference on the current repository
  • tier difference on the current repository
  • which policy dimensions changed

The key UX question is not:

"what settings changed?"

It is:

"what did the repository outcome change to, and why?"

11.2.5 Current Named Profile Definitions

The current implementation defines two named profiles:

  • default
  • strict_clinical_adjacency

Deferred until explicitly defined:

  • documentation_lenient
    • not active in the 1.8.0 rule table
  • research_repo_baseline
    • not active in the 1.8.0 rule table
  • biosecurity_cautious
    • not active in the 1.8.0 rule table
  • reproducibility_first
    • intentionally deferred until an actual policy diff exists and a release-grade recommendation path is defined

Documented diff fields per active profile:

  • stage weights
  • clinical cap / hard-floor posture
  • Stage 3 B2 strictness posture
  • Stage 4 emphasis posture

Current starter diff:

Profile Stage weights Clinical posture B2 posture Stage 4 posture
default 0.40 / 0.20 / 0.40 standard CA cap / hard-floor rules structured boundary language for partial credit current baseline
strict_clinical_adjacency 0.40 / 0.20 / 0.40 tighter ca_no_disclaimer_cap=60, tighter t0_hard_floor_cap=35 same as default baseline

Any additional named profile must document its concrete diff here before it becomes eligible for CLI recommendation.

11.2.6 Next UX Step: simulate --profile-file

The next reasonable researcher-facing extension is a simulation-only profile file input.

Recommended shape:

stem policy simulate <repo> --profile-file my_profile.json

This should be allowed because it improves domain experimentation without weakening the authoritative scan contract.

Intended behavior:

  • load a local profile file through the same schema and runtime validation path
  • treat the file as simulation-only input
  • do not register the file as an installed named profile automatically
  • do not let scan --policy or gate --policy consume it on the authoritative path

Required guardrails:

  • schema-valid before simulation starts
  • profile_read_mode must remain mirror_only or preview_only for external profile-file simulation
  • artifact output must clearly mark the file as local simulation input, not a packaged release profile
  • profile hash should still be surfaced so simulation outputs remain comparable

Recommended artifact labels for this future path:

  • profile_name = external_profile_file
  • profile_status = preview_only
  • profile_source = local_file
  • policy_sha256 = <computed canonical hash>

The goal is to let researchers try domain-specific posture proposals without creating a backdoor around governed named-profile promotion.

11.3 Side-by-Side Simulation

This subsection describes the more general simulation surface, which may also be used outside the intent-scale interview.

The most important feedback loop is not the profile editor itself.

It is the comparison view.

Researchers should be able to see:

  • default profile result
  • custom profile result
  • score delta
  • tier delta
  • cap / hard-floor delta
  • which evidence lanes changed the outcome

The right question is not:

"what numbers did I change?"

It is:

"what review outcome changed, and why?"

11.4 Roles

The intended governance model is not "every researcher defines the official score alone."

Recommended roles:

  • researcher
    • explains domain priorities
    • surfaces false positives / false negatives
    • evaluates whether the default policy fits the repository context
  • policy steward
    • maintains release-grade named profiles
    • reviews score-affecting changes
    • prevents silent drift between personal and team policy
  • tool
    • computes policy diff
    • computes result diff
    • records profile metadata in artifacts

This division keeps domain experts involved without sacrificing reproducibility.

11.4.1 Researcher Participation Rules

The participation model should stay simple:

  • researchers may propose posture changes
  • researchers may run derive and simulate
  • researchers may edit personal or branch-local preview profiles for comparison work
  • researchers should not directly redefine the release-grade default policy on the authoritative score path

The key distinction is:

  • preview_only and experimental are valid spaces for domain input
  • authoritative_release is a governed release artifact

This means a researcher can legitimately say:

"for this domain, I want stricter clinical-adjacent treatment and stronger reproducibility emphasis"

but should not unilaterally convert that statement into:

  • new official tier boundaries
  • new default caps
  • new score-bearing detector promotion
  • new release-grade policy semantics

The system should therefore optimize for:

  • easy posture expression
  • easy repository-specific simulation
  • hard-to-mutate official policy

11.4.2 Operating Principles

Operationally, the collaboration rule is:

  1. the researcher expresses domain priorities
  2. the tool translates those priorities into a visible named-profile recommendation or preview_only delta
  3. the policy steward decides whether that posture remains local, becomes experimental, or is promoted into a release-grade policy artifact

In practice:

  • a researcher should be able to explore calibration without editing scanner code
  • a steward should be able to reject score-affecting drift even when the local preview is reasonable
  • artifacts should clearly distinguish personal preview from official release policy

The intended output is not "personalized truth."

The intended output is:

  • a stable official score policy
  • a transparent preview lane for domain-specific posture testing
  • an explicit governance path between the two

11.4.3 Responsibility Matrix

Action Researcher Policy steward Tool
express domain posture primary optional review guided input surface
run stem policy derive primary optional review translates intent
run stem policy simulate primary optional review computes baseline vs preview
edit preview_only deltas for local exploration allowed review optional validates bounded deltas
create or modify experimental named profiles propose approve / reject validates profile schema and metadata
promote a profile to benchmark_candidate propose evidence required approval records status transition
promote a profile to authoritative_release provide domain rationale required approval requires parity / benchmark metadata
change default release policy semantics no unilateral authority required owner records artifact provenance
change score-bearing detector status no unilateral authority required owner enforces transition metadata

11.5 Promotion Path

Recommended progression:

  1. personal preview profile
  2. side-by-side run against real repository output
  3. review and comparison against default profile
  4. promotion to named team policy if approved

That promotion should update:

  • profile_name
  • profile_status
  • policy_sha256
  • changelog / rationale references when score logic changes

11.5.1 Promotion Gates

The progression above should not be symbolic only.

Each transition should have an explicit gate:

From To Minimum gate
preview_only experimental profile file exists, schema-valid, bounded diff documented, repository-side simulation reviewed
experimental benchmark_candidate compared against default on named fixtures or benchmark repos, intended score deltas explained, no hidden arithmetic
benchmark_candidate authoritative_release parity or benchmark note completed, rationale updated, changelog entry prepared, steward approval recorded
authoritative_release deprecated replacement or retirement note recorded, artifact comparability preserved

The intent is to prevent one common failure mode:

a locally useful domain tweak quietly becoming the new official score policy without an explicit release decision

11.5.2 What Researchers Can Change Directly

Researchers should be allowed to directly change:

  • intent-scale answers
  • selected baseline profile for simulation
  • branch-local preview_only deltas inside documented bounds
  • explanatory notes attached to why a preview better matches the domain

Researchers should not directly change, on the authoritative path:

  • default release profile semantics
  • tier boundaries for the official score
  • detector graduation state
  • score-bearing penalty activation rules
  • release-grade policy status labels

If a domain team wants one of those changes, the correct path is:

  1. simulate locally
  2. capture the proposed diff
  3. compare against default on real repositories
  4. propose promotion through the governed profile path

11.6 Interface Direction

Recommended near-term CLI additions:

  • stem policy list
  • stem policy explain <name>
  • stem scan <repo> --policy <name>

Recommended later UX additions:

  • wizard-style policy derivation
  • side-by-side simulation view
  • profile diff explanation panel
  • simulate --profile-file <path> for governed local experimentation without named-profile promotion

The guiding rule is:

researchers should tune posture through explicit choices, not hidden arithmetic.


12. Implemented Milestones and Remaining Roadmap

1.6.5: mirror-only profile contract

  • create the profile file
  • keep scanner behavior unchanged
  • expose profile metadata in JSON output
  • verify the profile matches current release behavior exactly using a differential fixture set with gold outputs

Recommended fixture format:

{
  "fixture_name": "default_profile_parity_small_repo",
  "target_repo": "tests/fixtures/repos/small_repo_a",
  "expected": {
    "raw_score_before_floor": 67,
    "final_score": 67,
    "formal_tier": "T2 Caution",
    "score_cap": null
  }
}

Recommended fixture location:

  • tests/fixtures/calibration_profiles/

Phase 1 parity target fields:

  • raw_score_before_floor
  • final_score
  • formal_tier
  • classification.score_cap

1.6.6: policy visibility

  • added stem policy list and stem policy explain
  • added --policy <name> on scan/gate/advisory workflows
  • surfaced selected profile metadata in stdout, Markdown, explain text, and PDF headers

1.6.7: derive/simulate preview

  • added stem policy derive for auditable intent translation, now standardized on the governed 1–5 posture scale
  • added stem policy simulate <repo> for baseline-vs-preview outcome comparison
  • kept derive/simulate outputs mirror-only so authoritative scoring remains unchanged

1.6.8: preview hardening and citation readiness

  • kept mirror-only scan scoring unchanged while hardening preview simulation against future profile drift
  • aligned simulate with profile-aware C1 penalty behavior instead of assuming scanner constants forever
  • revalidated preview_only profiles after bounded deltas are applied
  • strengthened mirror-only wording across CLI and report surfaces so scan --policy is not confused with policy simulate
  • added CITATION.cff and .zenodo.json so release artifacts are ready for DOI-backed citation once GitHub releases are archived by Zenodo

Remaining roadmap

  • authoritative read-through of policy weights/caps/thresholds
  • additional release-grade named profiles beyond strict_clinical_adjacency
  • explicit read-through for currently declared mirror-only fields such as stage_3_policy.b2_partial_credit_mode
  • ca-taxonomy-vN governance policy so runtime trigger-set changes are versioned as first-class release events
  • Phase 2 target release remains intentionally unset until parity fixtures, differential tests, and rollback notes are ready for the first score-authoritative read-through patch
  • future score-affecting policy changes require:
    • profile update
    • rationale update
    • changelog entry
    • benchmark note when relevant

This keeps calibration governance ahead of personalization while avoiding a risky “big bang” rewrite.


13. Non-Goals

This architecture does not aim to:

  • let users personalize trust scores
  • introduce hidden model-based calibration
  • turn regulatory mapping into a numerical score multiplier
  • make advisory output part of the formal score
  • replace benchmark or manual review with profile editing

It only aims to make calibration easier to maintain without weakening lane boundaries.


15. Draft: reproducibility_first

reproducibility_first is still deferred as an active named recommendation, but a draft posture is reasonable now.

Draft intent:

  • researchers want reproducibility evidence to matter more in review posture
  • they do not necessarily want stricter clinical caps
  • they are usually asking for stronger replication scrutiny, not a different claim-risk philosophy

Draft profile shape:

Profile Stage weights Clinical posture B2 posture Stage 4 posture
reproducibility_first (draft) 0.40 / 0.20 / 0.40 same as default same as default stronger_than_baseline

Recommended initial metadata:

  • profile_status = experimental
  • profile_read_mode = mirror_only
  • policy_version = ca-policy-1.0-repro-first

Important limitation:

In the current engine, Stage 4 remains a separate replication lane and does not alter score.final_score.

That means reproducibility_first is currently useful for:

  • simulation posture comparison
  • replication-lane emphasis
  • future promotion groundwork

but not yet for:

  • score-authoritative final-score change on the formal scan path

So the draft should remain deferred until one of the following is true:

  1. there is a release-grade rationale for how stronger replication posture should affect official review outcomes
  2. the simulation/report surface exposes a meaningful replication-posture difference without pretending it changed the formal score
  3. a future Phase 2 read-through explicitly defines what parts of Stage 4 posture become authoritative and under what governance rule

Until then, high reproducibility intent should continue to:

  • fall back to preview_only
  • raise Stage 4 emphasis in simulation
  • avoid pretending that an official release-grade named profile already exists

14. Final Position

STEM BIO-AI now has a real calibration architecture.

The correct mechanism is:

a versioned calibration profile with explicit promotion rules

not:

ad hoc runtime tuning knobs

If the system is serious about preserving the distinction between:

  • formal score
  • diagnostics
  • regulatory traceability
  • advisory

then calibration must be governed with the same discipline.

The right outcome is not “more adjustable.”

The right outcome is:

more maintainable, without becoming easier to drift.