Spaces:

Flamehaven
/

stem-bio-ai

Running

App Files Files Community

stem-bio-ai / docs /CALIBRATION_PROFILE_DESIGN.md

Flamehaven Initiative

release: v1.8.0 mica runtime uplift

d647970 1 day ago

preview code

raw

history blame contribute delete

33.7 kB

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

STEM BIO-AI Calibration Profile Architecture

Version: 1.8.0 Status: implemented mirror-only calibration contract with derive/simulate preview surfaces; 1.8.0 preview hardening complete; authoritative read-through remains future work

1. Current State

STEM BIO-AI already separates formal scoring, deterministic diagnostics, regulatory traceability, and AI advisory into distinct lanes.

As of 1.8.0, the repository ships a real calibration architecture:

packaged profiles in policy/
schema and runtime validation
result metadata surfacing for active profile identity
CLI policy visibility via stem policy list, stem policy explain, and --policy <name>
researcher-intent preview surfaces via stem policy derive and stem policy simulate

What is still not fully separated is the authoritative score read-through surface:

stage weights
tier boundaries
clinical caps and hard floors
evidence-only versus score-authoritative detector status
reasoning-model status labels

In 1.8.0, most score-affecting values are still implemented as runtime constants plus prose in SCORING_RATIONALE.md, even though mirror-only profile metadata, CLI-visible profile selection, and derive/simulate preview surfaces are already live. That is acceptable for the current release line, but it still creates a long-term maintenance risk:

if calibration values are easy to change but hard to govern, the architecture will drift even if the lane boundaries remain conceptually correct.

This document describes the implemented versioned calibration profile architecture and the remaining governed path to authoritative read-through.

2. Problem Statement

As advisory systems become stronger, teams usually feel pressure to:

raise or lower tier thresholds
relax or tighten clinical caps
promote diagnostics from evidence-only into score-bearing logic
add new penalties or soften old ones
reinterpret reasoning or advisory outputs as scoring evidence

If those changes happen ad hoc in code, three problems appear:

the formal score becomes harder to reproduce across versions
policy drift hides inside implementation edits
advisory or diagnostic signals may slowly leak into the formal score without an explicit governance decision

The issue is not whether calibration should ever change.

The issue is whether calibration changes are:

versioned
reviewable
explainable in artifacts
bounded by explicit promotion rules

3. Design Goal

The goal is not to let users arbitrarily tune the score from the CLI.

The goal is to make calibration:

easy to inspect
easy to version
easy to compare between releases
hard to mutate accidentally

In short:

STEM BIO-AI needs easy maintenance, not easy drift.

4. Core Principle

Calibration should become a policy object, not a scattered implementation detail.

That means:

the active scoring profile is represented in a versioned file
result artifacts record which profile was used
policy changes become visible release events
score-affecting changes require explicit promotion criteria

This preserves the current architectural discipline:

formal score remains deterministic
diagnostics can stay evidence-only until promoted
advisory remains structurally subordinate to the score
regulatory mapping remains traceability support, not a score multiplier

5. Implemented Shape

Current packaged profile files:

policy/scoring_profile.default.v1.json policy/scoring_profile.strict_clinical_adjacency.v1.json

Deferred profiles:

reproducibility_first
documentation_lenient
research_repo_baseline
biosecurity_cautious

Important restriction:

normal users should select from named profiles
normal users should not pass arbitrary weights or tier cutoffs on the command line

Good:

stem scan <repo> --policy default
stem scan <repo> --policy benchmark-candidate

Bad:

stem scan <repo> --stage1-weight 0.35 --t3-threshold 68 --cap 72

The first preserves governance.
The second turns the tool into an untracked tuning console.

6. Current Profile Contract

Current shipped fields:

{
  "policy_schema_version": "1",
  "policy_version": "ca-policy-1.0",
  "tool_version_introduced": "1.6.5",
  "tool_version_last_validated": "1.8.0",
  "profile_name": "default",
  "profile_status": "authoritative_release",
  "profile_read_mode": "mirror_only",
  "weights": {
    "stage_1_percent": 40,
    "stage_2r_percent": 20,
    "stage_3_percent": 40
  },
  "stage_baselines": {
    "stage_1": 60,
    "stage_2r": 60,
    "stage_3": 0
  },
  "tier_policy": {
    "tier_names": ["T0", "T1", "T2", "T3", "T4"],
    "tier_boundaries": [40, 55, 70, 85],
    "boundary_semantics": "left_closed_right_open",
    "score_domain": "integer_0_to_100"
  },
  "clinical_policy": {
    "ca_no_disclaimer_cap": 69,
    "t0_hard_floor_cap": 39
  },
  "code_integrity_policy": {
    "C1_penalty": 10,
    "C2_score_affecting": false,
    "C3_score_affecting": false,
    "C4_score_affecting": false
  },
  "stage_3_policy": {
    "normalization": {
      "kind": "linear_round",
      "raw_max": 80,
      "target_max": 100,
      "rounding": "half_up_int"
    }
  },
  "diagnostic_policy": {
    "BIO_smiles_surface_integrity": "evidence_only",
    "BIO_smiles_rdkit_validation": "evidence_only",
    "BIO_smiles_parser_guard": "evidence_only",
    "BIO_silent_mock_fallback": "evidence_only",
    "BIO_traceability_manifest_surface": "evidence_only",
    "BIO_subprocess_run_trace": "evidence_only"
  },
  "reasoning_policy": {
    "status": "diagnostic_only_uncalibrated_initial_prior",
    "score_integration": "forbidden"
  },
  "governance_sources": {
    "ca_taxonomy_version": "ca-taxonomy-v1",
    "ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
  }
}

This is the active shipped schema family in 1.8.0.

Schema notes:

weights should be stored as integer percentages, not floating-point fractions
tier boundaries should be stored once as a single ordered array
normalization should be represented as named semantics plus parameters, not a free-form expression string
policy_version should be independent from the tool release version
profile_read_mode must distinguish mirror-only exposure from authoritative runtime loading
stage_3_policy.b2_partial_credit_mode is currently a declared mirror-only profile field; authoritative Stage 3 B2 scoring in 1.8.0 still follows the hardcoded scanner path and does not yet read this value directly
governance_sources.ca_taxonomy_version must increment whenever runtime CA trigger membership, severity mapping, or cap-relevant phrase semantics change

Current profile_status state set:

preview_only
experimental
benchmark_candidate
authoritative_release
deprecated

Current status transition path:

preview_only -> experimental -> benchmark_candidate -> authoritative_release -> deprecated

Other transitions should require an explicit migration note.

7. Artifact Requirements

Every result object should record:

policy_schema_version
policy_version
profile_name
profile_status
profile_read_mode
policy_sha256

Why:

two runs are not meaningfully comparable unless they share the same active profile
policy drift should be visible in the artifact itself
benchmark comparisons should be able to say whether differences came from repository evidence or policy revision
mirror-only and authoritative-read runs must not look equivalent in artifacts

policy_sha256 must be defined precisely.

Recommended definition:

canonicalize the policy JSON using sorted keys and UTF-8 encoding
exclude the policy_sha256 field itself from the hash input
hash the canonicalized policy file bytes only

In mirror_only mode, the profile file may leave policy_sha256 as null.

The runtime artifact should still surface the computed canonical hash so profile comparisons remain stable during Phase 1.

This hash does not claim to represent every runtime governance source.

Instead, runtime governance dependencies such as the CA taxonomy should be surfaced separately under governance_sources.

Recommended JSON example:

"calibration_profile": {
  "policy_schema_version": "1",
  "policy_version": "ca-policy-1.0",
  "profile_name": "default",
  "profile_status": "authoritative_release",
  "profile_read_mode": "mirror_only",
  "policy_sha256": "..."
},
"governance_sources": {
  "ca_taxonomy_version": "ca-taxonomy-v1",
  "ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
}

8. Diagnostics Graduation Policy

The hardest maintenance problem is not weight tuning.

It is detector promotion:

when does an evidence-only detector become score-authoritative?

Recommended detector states:

evidence_only
candidate_scored
scored
deprecated

Recommended transition rules:

From	To	Allowed?	Notes
`evidence_only`	`candidate_scored`	yes	requires promotion gate below
`candidate_scored`	`scored`	yes	requires promotion gate below
`candidate_scored`	`evidence_only`	yes	allowed if benchmark review regresses confidence
`scored`	`candidate_scored`	yes	allowed for rollback after release observation
`scored`	`deprecated`	yes	allowed when detector is retired
`evidence_only`	`deprecated`	yes	allowed when detector is abandoned
`deprecated`	any active state	no by default	require explicit redesign note

Recommended promotion gate before moving from evidence_only to candidate_scored:

commit-pinned benchmark fixtures exist for at least N >= 20 repositories
detector output is reproducible across 3 consecutive identical runs
false-positive review has been documented with observed false_positive_rate <= 0.05
a release note explains what changed
SCORING_RATIONALE.md is updated if the detector affects score logic

Recommended promotion gate before moving from candidate_scored to scored:

benchmark evidence shows the detector improves review precision on the maintained fixture set
at least one release cycle of observation has occurred
the profile change is versioned as a policy revision

This is the governance mechanism that prevents “AI got more capable, so we quietly started scoring with it.”

9. Advisory Boundary Rule

The calibration profile should explicitly state that advisory output cannot rewrite the formal score unless a future architecture intentionally changes that rule.

Recommended field:

"reasoning_policy": {
  "status": "diagnostic_only_uncalibrated_initial_prior",
  "score_integration": "forbidden"
}

That matters because boundary failures often begin as convenience:

a provider looks helpful
the advisory output seems more nuanced
a team wants to “just incorporate it a little”

Once that happens without a versioned policy change, the formal score stops being what the architecture claims it is.

10. CLI Policy Surface

Recommended CLI behavior:

--policy default
--policy <named_profile>
--list-policies

Not recommended:

direct numeric overrides for weights, thresholds, caps, or detector promotion state

Developer-only experimental override support is acceptable if all of the following are true:

it is clearly marked non-authoritative
it writes a different profile_status
output artifacts visibly say experimental policy was used
it is excluded from default examples and documentation

11. Researcher UX and Participation Model

The most important UX constraint is this:

researchers should be able to influence policy intent without turning the CLI into a free-form scoring console.

That means STEM BIO-AI should prefer:

named profile templates
guided questions
side-by-side score diffs
explicit promotion to shared policy

over:

raw numeric knobs
hidden threshold editing
untracked one-off scoring profiles

11.1 Starting Point: Profile Templates

The first interaction should not be:

"enter your own weights and caps"

It should be:

"which evaluation posture best matches your repository context?"

Current active named profiles:

default
strict_clinical_adjacency

Deferred named profiles:

reproducibility_first
research_repo_baseline
documentation_lenient
biosecurity_cautious

These names are easier for researchers to reason about than raw numbers.

11.2 Guided Policy Builder

After a template is selected, the next layer should be a guided builder rather than free-form editing.

Examples of acceptable questions:

"Should code-integrity evidence outweigh README surface evidence?"
"Should clinical-adjacent claims trigger stricter caps?"
"Should bias/limitations require structured sections rather than term presence?"
"Should replication evidence matter more for your workflow?"

The user answers policy questions.

The system translates them into profile deltas.

This preserves usability while keeping the policy surface inspectable.

11.2.1 Researcher Intent Scale

Before users touch any named policy, STEM BIO-AI should reduce the interpretation gap between:

what the researcher actually cares about
what the default profile currently emphasizes

The implemented mechanism is a researcher intent layer built around short 1–5 scales.

Important boundary:

the 1–5 scale is a UX input surface, not part of the formal score engine.

In other words:

users do not set formal weights directly
users do not set tier thresholds directly
users do not generate a score by summing their answers

Instead, the scale helps the system infer which existing policy posture is closer to the researcher's intent.

Recommended scale interpretation:

1 = minimal emphasis
2 = light emphasis
3 = moderate emphasis
4 = strong emphasis
5 = very strong emphasis

Current question areas:

how strict clinical-adjacent claims should be treated
whether code-integrity evidence should outweigh README/documentation evidence
how much reproducibility evidence should matter
whether structured limitations should be required before partial credit is awarded

This approach borrows the usability advantage of Likert-style scales without turning the scanner into a free-form tuning instrument.

11.2.2 Why the Scale Belongs in UX, Not Scoring

Researchers can usually answer:

"clinical-adjacent claims should be treated very strictly"

more reliably than:

"set the CA cap delta to -12 and reduce the Stage 1 weight by 0.05"

That is why the scale should live in the interview layer.

The formal engine should still consume:

named profiles
explicit policy objects
versioned calibration state

The scale is only a translation surface between human intent and governed policy.

11.2.3 Translation Rule

The system should map researcher answers to one of three outcomes:

recommend an existing named profile
show a preview-only profile delta
show that the default profile already matches the stated posture under an explicit rule

This is safer than letting the user edit raw scoring parameters directly.

The current implementation uses an auditable rule table instead of a hidden similarity function.

Current intent variables:

clinical_strictness
code_integrity_priority
reproducibility_priority
structured_limitations_requirement

Current 1.8.0 decision rules:

Condition	Outcome
`clinical_strictness >= 4` and `reproducibility_priority <= 3`	recommend `strict_clinical_adjacency`
all four values are `2` or `3`	keep `default`
no named profile rule matches	generate `preview_only` profile delta from explicit bounded deltas only

This narrow table is intentional. It keeps the translation layer visible, reviewable, and testable without pretending that every strong posture already has a release-grade named profile. In particular, reproducibility_first remains deferred in 1.8.0; high reproducibility answers still fall back to preview_only Stage 4 emphasis rather than a named recommendation.

Rule priority:

evaluate rules top-down
stop at the first named-profile match
if multiple strong postures are simultaneously requested and no single named profile dominates, fall back to preview_only

Example:

clinical_strictness = 4
reproducibility_priority = 4

This should fall back to preview_only in the initial implementation rather than pretending one named posture has clear priority.

Lower-bound meaning:

1 means minimal emphasis
1 does not remove or disable an axis
therefore the minimum scale value still participates in threshold checks such as <= 2

Current "default already matches" rule:

if the selected baseline is default
and all four intent variables are in the 2..3 range
and no explicit named-profile rule is triggered

then the system should report that the default profile already matches the stated posture closely enough to avoid a custom preview.

Current preview_only boundary:

do not compute nearest-profile distance
do not infer hidden similarity scores
do not mutate arbitrary raw numbers

Instead:

start from the selected baseline profile
apply explicit bounded deltas associated with the triggered answers
mark the result as preview_only

This keeps the intent layer auditable during the first implementation cycle.

Current bounded deltas used in preview-only mode:

Triggered answer	Allowed `preview_only` delta shape
`clinical_strictness >= 4` with no named-profile match	switch to stricter CA posture only; do not change unrelated weights
`reproducibility_priority >= 4` with no named-profile match	raise Stage 4 emphasis only within predeclared policy bounds
`structured_limitations_requirement >= 4` with no named-profile match	require stricter Stage 3 B2 partial-credit posture only
multiple strong answers with no named-profile match	combine only explicitly listed bounded deltas; do not infer new arithmetic outside documented policy fields

These are active preview-only deltas in 1.8.0. They are not hidden similarity operations and they do not mutate the authoritative scan path.

11.2.4 Comparison Output

This subsection describes the immediate output of the intent-scale flow.

If the scale is used, the system should immediately show:

chosen baseline profile
recommended profile or preview delta
score difference on the current repository
tier difference on the current repository
which policy dimensions changed

The key UX question is not:

"what settings changed?"

It is:

"what did the repository outcome change to, and why?"

11.2.5 Current Named Profile Definitions

The current implementation defines two named profiles:

default
strict_clinical_adjacency

Deferred until explicitly defined:

documentation_lenient
- not active in the 1.8.0 rule table
research_repo_baseline
- not active in the 1.8.0 rule table
biosecurity_cautious
- not active in the 1.8.0 rule table
reproducibility_first
- intentionally deferred until an actual policy diff exists and a release-grade recommendation path is defined

Documented diff fields per active profile:

stage weights
clinical cap / hard-floor posture
Stage 3 B2 strictness posture
Stage 4 emphasis posture

Current starter diff:

Profile	Stage weights	Clinical posture	B2 posture	Stage 4 posture
`default`	`0.40 / 0.20 / 0.40`	standard CA cap / hard-floor rules	structured boundary language for partial credit	current baseline
`strict_clinical_adjacency`	`0.40 / 0.20 / 0.40`	tighter `ca_no_disclaimer_cap=60`, tighter `t0_hard_floor_cap=35`	same as `default`	`baseline`

Any additional named profile must document its concrete diff here before it becomes eligible for CLI recommendation.

11.2.6 Next UX Step: `simulate --profile-file`

The next reasonable researcher-facing extension is a simulation-only profile file input.

Recommended shape:

stem policy simulate <repo> --profile-file my_profile.json

This should be allowed because it improves domain experimentation without weakening the authoritative scan contract.

Intended behavior:

load a local profile file through the same schema and runtime validation path
treat the file as simulation-only input
do not register the file as an installed named profile automatically
do not let scan --policy or gate --policy consume it on the authoritative path

Required guardrails:

schema-valid before simulation starts
profile_read_mode must remain mirror_only or preview_only for external profile-file simulation
artifact output must clearly mark the file as local simulation input, not a packaged release profile
profile hash should still be surfaced so simulation outputs remain comparable

Recommended artifact labels for this future path:

profile_name = external_profile_file
profile_status = preview_only
profile_source = local_file
policy_sha256 = <computed canonical hash>

The goal is to let researchers try domain-specific posture proposals without creating a backdoor around governed named-profile promotion.

11.3 Side-by-Side Simulation

This subsection describes the more general simulation surface, which may also be used outside the intent-scale interview.

The most important feedback loop is not the profile editor itself.

It is the comparison view.

Researchers should be able to see:

default profile result
custom profile result
score delta
tier delta
cap / hard-floor delta
which evidence lanes changed the outcome

The right question is not:

"what numbers did I change?"

It is:

"what review outcome changed, and why?"

11.4 Roles

The intended governance model is not "every researcher defines the official score alone."

Recommended roles:

researcher
- explains domain priorities
- surfaces false positives / false negatives
- evaluates whether the default policy fits the repository context
policy steward
- maintains release-grade named profiles
- reviews score-affecting changes
- prevents silent drift between personal and team policy
tool
- computes policy diff
- computes result diff
- records profile metadata in artifacts

This division keeps domain experts involved without sacrificing reproducibility.

11.4.1 Researcher Participation Rules

The participation model should stay simple:

researchers may propose posture changes
researchers may run derive and simulate
researchers may edit personal or branch-local preview profiles for comparison work
researchers should not directly redefine the release-grade default policy on the authoritative score path

The key distinction is:

preview_only and experimental are valid spaces for domain input
authoritative_release is a governed release artifact

This means a researcher can legitimately say:

"for this domain, I want stricter clinical-adjacent treatment and stronger reproducibility emphasis"

but should not unilaterally convert that statement into:

new official tier boundaries
new default caps
new score-bearing detector promotion
new release-grade policy semantics

The system should therefore optimize for:

easy posture expression
easy repository-specific simulation
hard-to-mutate official policy

11.4.2 Operating Principles

Operationally, the collaboration rule is:

the researcher expresses domain priorities
the tool translates those priorities into a visible named-profile recommendation or preview_only delta
the policy steward decides whether that posture remains local, becomes experimental, or is promoted into a release-grade policy artifact

In practice:

a researcher should be able to explore calibration without editing scanner code
a steward should be able to reject score-affecting drift even when the local preview is reasonable
artifacts should clearly distinguish personal preview from official release policy

The intended output is not "personalized truth."

The intended output is:

a stable official score policy
a transparent preview lane for domain-specific posture testing
an explicit governance path between the two

11.4.3 Responsibility Matrix

Action	Researcher	Policy steward	Tool
express domain posture	primary	optional review	guided input surface
run `stem policy derive`	primary	optional review	translates intent
run `stem policy simulate`	primary	optional review	computes baseline vs preview
edit `preview_only` deltas for local exploration	allowed	review optional	validates bounded deltas
create or modify `experimental` named profiles	propose	approve / reject	validates profile schema and metadata
promote a profile to `benchmark_candidate`	propose evidence	required approval	records status transition
promote a profile to `authoritative_release`	provide domain rationale	required approval	requires parity / benchmark metadata
change default release policy semantics	no unilateral authority	required owner	records artifact provenance
change score-bearing detector status	no unilateral authority	required owner	enforces transition metadata

11.5 Promotion Path

Recommended progression:

personal preview profile
side-by-side run against real repository output
review and comparison against default profile
promotion to named team policy if approved

That promotion should update:

profile_name
profile_status
policy_sha256
changelog / rationale references when score logic changes

11.5.1 Promotion Gates

The progression above should not be symbolic only.

Each transition should have an explicit gate:

From	To	Minimum gate
`preview_only`	`experimental`	profile file exists, schema-valid, bounded diff documented, repository-side simulation reviewed
`experimental`	`benchmark_candidate`	compared against default on named fixtures or benchmark repos, intended score deltas explained, no hidden arithmetic
`benchmark_candidate`	`authoritative_release`	parity or benchmark note completed, rationale updated, changelog entry prepared, steward approval recorded
`authoritative_release`	`deprecated`	replacement or retirement note recorded, artifact comparability preserved

The intent is to prevent one common failure mode:

a locally useful domain tweak quietly becoming the new official score policy without an explicit release decision

11.5.2 What Researchers Can Change Directly

Researchers should be allowed to directly change:

intent-scale answers
selected baseline profile for simulation
branch-local preview_only deltas inside documented bounds
explanatory notes attached to why a preview better matches the domain

Researchers should not directly change, on the authoritative path:

default release profile semantics
tier boundaries for the official score
detector graduation state
score-bearing penalty activation rules
release-grade policy status labels

If a domain team wants one of those changes, the correct path is:

simulate locally
capture the proposed diff
compare against default on real repositories
propose promotion through the governed profile path

11.6 Interface Direction

Recommended near-term CLI additions:

stem policy list
stem policy explain <name>
stem scan <repo> --policy <name>

Recommended later UX additions:

wizard-style policy derivation
side-by-side simulation view
profile diff explanation panel
simulate --profile-file <path> for governed local experimentation without named-profile promotion

The guiding rule is:

researchers should tune posture through explicit choices, not hidden arithmetic.

12. Implemented Milestones and Remaining Roadmap

1.6.5: mirror-only profile contract

create the profile file
keep scanner behavior unchanged
expose profile metadata in JSON output
verify the profile matches current release behavior exactly using a differential fixture set with gold outputs

Recommended fixture format:

{
  "fixture_name": "default_profile_parity_small_repo",
  "target_repo": "tests/fixtures/repos/small_repo_a",
  "expected": {
    "raw_score_before_floor": 67,
    "final_score": 67,
    "formal_tier": "T2 Caution",
    "score_cap": null
  }
}

Recommended fixture location:

tests/fixtures/calibration_profiles/

Phase 1 parity target fields:

raw_score_before_floor
final_score
formal_tier
classification.score_cap

1.6.6: policy visibility

added stem policy list and stem policy explain
added --policy <name> on scan/gate/advisory workflows
surfaced selected profile metadata in stdout, Markdown, explain text, and PDF headers

1.6.7: derive/simulate preview

added stem policy derive for auditable intent translation, now standardized on the governed 1–5 posture scale
added stem policy simulate <repo> for baseline-vs-preview outcome comparison
kept derive/simulate outputs mirror-only so authoritative scoring remains unchanged

1.6.8: preview hardening and citation readiness

kept mirror-only scan scoring unchanged while hardening preview simulation against future profile drift
aligned simulate with profile-aware C1 penalty behavior instead of assuming scanner constants forever
revalidated preview_only profiles after bounded deltas are applied
strengthened mirror-only wording across CLI and report surfaces so scan --policy is not confused with policy simulate
added CITATION.cff and .zenodo.json so release artifacts are ready for DOI-backed citation once GitHub releases are archived by Zenodo

Remaining roadmap

authoritative read-through of policy weights/caps/thresholds
additional release-grade named profiles beyond strict_clinical_adjacency
explicit read-through for currently declared mirror-only fields such as stage_3_policy.b2_partial_credit_mode
ca-taxonomy-vN governance policy so runtime trigger-set changes are versioned as first-class release events
Phase 2 target release remains intentionally unset until parity fixtures, differential tests, and rollback notes are ready for the first score-authoritative read-through patch
future score-affecting policy changes require:
- profile update
- rationale update
- changelog entry
- benchmark note when relevant

This keeps calibration governance ahead of personalization while avoiding a risky “big bang” rewrite.

13. Non-Goals

This architecture does not aim to:

let users personalize trust scores
introduce hidden model-based calibration
turn regulatory mapping into a numerical score multiplier
make advisory output part of the formal score
replace benchmark or manual review with profile editing

It only aims to make calibration easier to maintain without weakening lane boundaries.

15. Draft: `reproducibility_first`

reproducibility_first is still deferred as an active named recommendation, but a draft posture is reasonable now.

Draft intent:

researchers want reproducibility evidence to matter more in review posture
they do not necessarily want stricter clinical caps
they are usually asking for stronger replication scrutiny, not a different claim-risk philosophy

Draft profile shape:

Profile	Stage weights	Clinical posture	B2 posture	Stage 4 posture
`reproducibility_first` (draft)	`0.40 / 0.20 / 0.40`	same as `default`	same as `default`	`stronger_than_baseline`

Recommended initial metadata:

profile_status = experimental
profile_read_mode = mirror_only
policy_version = ca-policy-1.0-repro-first

Important limitation:

In the current engine, Stage 4 remains a separate replication lane and does not alter score.final_score.

That means reproducibility_first is currently useful for:

simulation posture comparison
replication-lane emphasis
future promotion groundwork

but not yet for:

score-authoritative final-score change on the formal scan path

So the draft should remain deferred until one of the following is true:

there is a release-grade rationale for how stronger replication posture should affect official review outcomes
the simulation/report surface exposes a meaningful replication-posture difference without pretending it changed the formal score
a future Phase 2 read-through explicitly defines what parts of Stage 4 posture become authoritative and under what governance rule

Until then, high reproducibility intent should continue to:

fall back to preview_only
raise Stage 4 emphasis in simulation
avoid pretending that an official release-grade named profile already exists

14. Final Position

STEM BIO-AI now has a real calibration architecture.

The correct mechanism is:

a versioned calibration profile with explicit promotion rules

not:

ad hoc runtime tuning knobs

If the system is serious about preserving the distinction between:

formal score
diagnostics
regulatory traceability
advisory

then calibration must be governed with the same discipline.

The right outcome is not “more adjustable.”

The right outcome is:

more maintainable, without becoming easier to drift.