Spaces:
Running
A newer version of the Gradio SDK is available: 6.15.2
STEM BIO-AI Calibration Profile Architecture
Version: 1.8.0 Status: implemented mirror-only calibration contract with derive/simulate preview surfaces; 1.8.0 preview hardening complete; authoritative read-through remains future work
1. Current State
STEM BIO-AI already separates formal scoring, deterministic diagnostics, regulatory traceability, and AI advisory into distinct lanes.
As of 1.8.0, the repository ships a real calibration architecture:
- packaged profiles in
policy/ - schema and runtime validation
- result metadata surfacing for active profile identity
- CLI policy visibility via
stem policy list,stem policy explain, and--policy <name> - researcher-intent preview surfaces via
stem policy deriveandstem policy simulate
What is still not fully separated is the authoritative score read-through surface:
- stage weights
- tier boundaries
- clinical caps and hard floors
- evidence-only versus score-authoritative detector status
- reasoning-model status labels
In 1.8.0, most score-affecting values are still implemented as runtime constants plus prose in SCORING_RATIONALE.md, even though mirror-only profile metadata, CLI-visible profile selection, and derive/simulate preview surfaces are already live. That is acceptable for the current release line, but it still creates a long-term maintenance risk:
if calibration values are easy to change but hard to govern, the architecture will drift even if the lane boundaries remain conceptually correct.
This document describes the implemented versioned calibration profile architecture and the remaining governed path to authoritative read-through.
2. Problem Statement
As advisory systems become stronger, teams usually feel pressure to:
- raise or lower tier thresholds
- relax or tighten clinical caps
- promote diagnostics from evidence-only into score-bearing logic
- add new penalties or soften old ones
- reinterpret reasoning or advisory outputs as scoring evidence
If those changes happen ad hoc in code, three problems appear:
- the formal score becomes harder to reproduce across versions
- policy drift hides inside implementation edits
- advisory or diagnostic signals may slowly leak into the formal score without an explicit governance decision
The issue is not whether calibration should ever change.
The issue is whether calibration changes are:
- versioned
- reviewable
- explainable in artifacts
- bounded by explicit promotion rules
3. Design Goal
The goal is not to let users arbitrarily tune the score from the CLI.
The goal is to make calibration:
- easy to inspect
- easy to version
- easy to compare between releases
- hard to mutate accidentally
In short:
STEM BIO-AI needs easy maintenance, not easy drift.
4. Core Principle
Calibration should become a policy object, not a scattered implementation detail.
That means:
- the active scoring profile is represented in a versioned file
- result artifacts record which profile was used
- policy changes become visible release events
- score-affecting changes require explicit promotion criteria
This preserves the current architectural discipline:
- formal score remains deterministic
- diagnostics can stay evidence-only until promoted
- advisory remains structurally subordinate to the score
- regulatory mapping remains traceability support, not a score multiplier
5. Implemented Shape
Current packaged profile files:
policy/scoring_profile.default.v1.json
policy/scoring_profile.strict_clinical_adjacency.v1.json
Deferred profiles:
reproducibility_firstdocumentation_lenientresearch_repo_baselinebiosecurity_cautious
Important restriction:
- normal users should select from named profiles
- normal users should not pass arbitrary weights or tier cutoffs on the command line
Good:
stem scan <repo> --policy default
stem scan <repo> --policy benchmark-candidate
Bad:
stem scan <repo> --stage1-weight 0.35 --t3-threshold 68 --cap 72
The first preserves governance.
The second turns the tool into an untracked tuning console.
6. Current Profile Contract
Current shipped fields:
{
"policy_schema_version": "1",
"policy_version": "ca-policy-1.0",
"tool_version_introduced": "1.6.5",
"tool_version_last_validated": "1.8.0",
"profile_name": "default",
"profile_status": "authoritative_release",
"profile_read_mode": "mirror_only",
"weights": {
"stage_1_percent": 40,
"stage_2r_percent": 20,
"stage_3_percent": 40
},
"stage_baselines": {
"stage_1": 60,
"stage_2r": 60,
"stage_3": 0
},
"tier_policy": {
"tier_names": ["T0", "T1", "T2", "T3", "T4"],
"tier_boundaries": [40, 55, 70, 85],
"boundary_semantics": "left_closed_right_open",
"score_domain": "integer_0_to_100"
},
"clinical_policy": {
"ca_no_disclaimer_cap": 69,
"t0_hard_floor_cap": 39
},
"code_integrity_policy": {
"C1_penalty": 10,
"C2_score_affecting": false,
"C3_score_affecting": false,
"C4_score_affecting": false
},
"stage_3_policy": {
"normalization": {
"kind": "linear_round",
"raw_max": 80,
"target_max": 100,
"rounding": "half_up_int"
}
},
"diagnostic_policy": {
"BIO_smiles_surface_integrity": "evidence_only",
"BIO_smiles_rdkit_validation": "evidence_only",
"BIO_smiles_parser_guard": "evidence_only",
"BIO_silent_mock_fallback": "evidence_only",
"BIO_traceability_manifest_surface": "evidence_only",
"BIO_subprocess_run_trace": "evidence_only"
},
"reasoning_policy": {
"status": "diagnostic_only_uncalibrated_initial_prior",
"score_integration": "forbidden"
},
"governance_sources": {
"ca_taxonomy_version": "ca-taxonomy-v1",
"ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
}
}
This is the active shipped schema family in 1.8.0.
Schema notes:
- weights should be stored as integer percentages, not floating-point fractions
- tier boundaries should be stored once as a single ordered array
- normalization should be represented as named semantics plus parameters, not a free-form expression string
policy_versionshould be independent from the tool release versionprofile_read_modemust distinguish mirror-only exposure from authoritative runtime loadingstage_3_policy.b2_partial_credit_modeis currently a declared mirror-only profile field; authoritative Stage 3 B2 scoring in1.8.0still follows the hardcoded scanner path and does not yet read this value directlygovernance_sources.ca_taxonomy_versionmust increment whenever runtime CA trigger membership, severity mapping, or cap-relevant phrase semantics change
Current profile_status state set:
preview_onlyexperimentalbenchmark_candidateauthoritative_releasedeprecated
Current status transition path:
preview_only -> experimental -> benchmark_candidate -> authoritative_release -> deprecated
Other transitions should require an explicit migration note.
7. Artifact Requirements
Every result object should record:
policy_schema_versionpolicy_versionprofile_nameprofile_statusprofile_read_modepolicy_sha256
Why:
- two runs are not meaningfully comparable unless they share the same active profile
- policy drift should be visible in the artifact itself
- benchmark comparisons should be able to say whether differences came from repository evidence or policy revision
- mirror-only and authoritative-read runs must not look equivalent in artifacts
policy_sha256 must be defined precisely.
Recommended definition:
- canonicalize the policy JSON using sorted keys and UTF-8 encoding
- exclude the
policy_sha256field itself from the hash input - hash the canonicalized policy file bytes only
In mirror_only mode, the profile file may leave policy_sha256 as null.
The runtime artifact should still surface the computed canonical hash so profile comparisons remain stable during Phase 1.
This hash does not claim to represent every runtime governance source.
Instead, runtime governance dependencies such as the CA taxonomy should be surfaced separately under governance_sources.
Recommended JSON example:
"calibration_profile": {
"policy_schema_version": "1",
"policy_version": "ca-policy-1.0",
"profile_name": "default",
"profile_status": "authoritative_release",
"profile_read_mode": "mirror_only",
"policy_sha256": "..."
},
"governance_sources": {
"ca_taxonomy_version": "ca-taxonomy-v1",
"ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
}
8. Diagnostics Graduation Policy
The hardest maintenance problem is not weight tuning.
It is detector promotion:
when does an evidence-only detector become score-authoritative?
Recommended detector states:
evidence_onlycandidate_scoredscoreddeprecated
Recommended transition rules:
| From | To | Allowed? | Notes |
|---|---|---|---|
evidence_only |
candidate_scored |
yes | requires promotion gate below |
candidate_scored |
scored |
yes | requires promotion gate below |
candidate_scored |
evidence_only |
yes | allowed if benchmark review regresses confidence |
scored |
candidate_scored |
yes | allowed for rollback after release observation |
scored |
deprecated |
yes | allowed when detector is retired |
evidence_only |
deprecated |
yes | allowed when detector is abandoned |
deprecated |
any active state | no by default | require explicit redesign note |
Recommended promotion gate before moving from evidence_only to candidate_scored:
- commit-pinned benchmark fixtures exist for at least
N >= 20repositories - detector output is reproducible across
3consecutive identical runs - false-positive review has been documented with observed
false_positive_rate <= 0.05 - a release note explains what changed
SCORING_RATIONALE.mdis updated if the detector affects score logic
Recommended promotion gate before moving from candidate_scored to scored:
- benchmark evidence shows the detector improves review precision on the maintained fixture set
- at least one release cycle of observation has occurred
- the profile change is versioned as a policy revision
This is the governance mechanism that prevents “AI got more capable, so we quietly started scoring with it.”
9. Advisory Boundary Rule
The calibration profile should explicitly state that advisory output cannot rewrite the formal score unless a future architecture intentionally changes that rule.
Recommended field:
"reasoning_policy": {
"status": "diagnostic_only_uncalibrated_initial_prior",
"score_integration": "forbidden"
}
That matters because boundary failures often begin as convenience:
- a provider looks helpful
- the advisory output seems more nuanced
- a team wants to “just incorporate it a little”
Once that happens without a versioned policy change, the formal score stops being what the architecture claims it is.
10. CLI Policy Surface
Recommended CLI behavior:
--policy default--policy <named_profile>--list-policies
Not recommended:
- direct numeric overrides for weights, thresholds, caps, or detector promotion state
Developer-only experimental override support is acceptable if all of the following are true:
- it is clearly marked non-authoritative
- it writes a different
profile_status - output artifacts visibly say experimental policy was used
- it is excluded from default examples and documentation
11. Researcher UX and Participation Model
The most important UX constraint is this:
researchers should be able to influence policy intent without turning the CLI into a free-form scoring console.
That means STEM BIO-AI should prefer:
- named profile templates
- guided questions
- side-by-side score diffs
- explicit promotion to shared policy
over:
- raw numeric knobs
- hidden threshold editing
- untracked one-off scoring profiles
11.1 Starting Point: Profile Templates
The first interaction should not be:
"enter your own weights and caps"
It should be:
"which evaluation posture best matches your repository context?"
Current active named profiles:
defaultstrict_clinical_adjacency
Deferred named profiles:
reproducibility_firstresearch_repo_baselinedocumentation_lenientbiosecurity_cautious
These names are easier for researchers to reason about than raw numbers.
11.2 Guided Policy Builder
After a template is selected, the next layer should be a guided builder rather than free-form editing.
Examples of acceptable questions:
- "Should code-integrity evidence outweigh README surface evidence?"
- "Should clinical-adjacent claims trigger stricter caps?"
- "Should bias/limitations require structured sections rather than term presence?"
- "Should replication evidence matter more for your workflow?"
The user answers policy questions.
The system translates them into profile deltas.
This preserves usability while keeping the policy surface inspectable.
11.2.1 Researcher Intent Scale
Before users touch any named policy, STEM BIO-AI should reduce the interpretation gap between:
- what the researcher actually cares about
- what the default profile currently emphasizes
The implemented mechanism is a researcher intent layer built around short 1–5 scales.
Important boundary:
the
1–5scale is a UX input surface, not part of the formal score engine.
In other words:
- users do not set formal weights directly
- users do not set tier thresholds directly
- users do not generate a score by summing their answers
Instead, the scale helps the system infer which existing policy posture is closer to the researcher's intent.
Recommended scale interpretation:
1= minimal emphasis2= light emphasis3= moderate emphasis4= strong emphasis5= very strong emphasis
Current question areas:
- how strict clinical-adjacent claims should be treated
- whether code-integrity evidence should outweigh README/documentation evidence
- how much reproducibility evidence should matter
- whether structured limitations should be required before partial credit is awarded
This approach borrows the usability advantage of Likert-style scales without turning the scanner into a free-form tuning instrument.
11.2.2 Why the Scale Belongs in UX, Not Scoring
Researchers can usually answer:
"clinical-adjacent claims should be treated very strictly"
more reliably than:
"set the CA cap delta to -12 and reduce the Stage 1 weight by 0.05"
That is why the scale should live in the interview layer.
The formal engine should still consume:
- named profiles
- explicit policy objects
- versioned calibration state
The scale is only a translation surface between human intent and governed policy.
11.2.3 Translation Rule
The system should map researcher answers to one of three outcomes:
- recommend an existing named profile
- show a preview-only profile delta
- show that the default profile already matches the stated posture under an explicit rule
This is safer than letting the user edit raw scoring parameters directly.
The current implementation uses an auditable rule table instead of a hidden similarity function.
Current intent variables:
clinical_strictnesscode_integrity_priorityreproducibility_prioritystructured_limitations_requirement
Current 1.8.0 decision rules:
| Condition | Outcome |
|---|---|
clinical_strictness >= 4 and reproducibility_priority <= 3 |
recommend strict_clinical_adjacency |
all four values are 2 or 3 |
keep default |
| no named profile rule matches | generate preview_only profile delta from explicit bounded deltas only |
This narrow table is intentional. It keeps the translation layer visible, reviewable, and testable without pretending that every strong posture already has a release-grade named profile. In particular, reproducibility_first remains deferred in 1.8.0; high reproducibility answers still fall back to preview_only Stage 4 emphasis rather than a named recommendation.
Rule priority:
- evaluate rules top-down
- stop at the first named-profile match
- if multiple strong postures are simultaneously requested and no single named profile dominates, fall back to
preview_only
Example:
clinical_strictness = 4reproducibility_priority = 4
This should fall back to preview_only in the initial implementation rather than pretending one named posture has clear priority.
Lower-bound meaning:
1means minimal emphasis1does not remove or disable an axis- therefore the minimum scale value still participates in threshold checks such as
<= 2
Current "default already matches" rule:
- if the selected baseline is
default - and all four intent variables are in the
2..3range - and no explicit named-profile rule is triggered
then the system should report that the default profile already matches the stated posture closely enough to avoid a custom preview.
Current preview_only boundary:
- do not compute nearest-profile distance
- do not infer hidden similarity scores
- do not mutate arbitrary raw numbers
Instead:
- start from the selected baseline profile
- apply explicit bounded deltas associated with the triggered answers
- mark the result as
preview_only
This keeps the intent layer auditable during the first implementation cycle.
Current bounded deltas used in preview-only mode:
| Triggered answer | Allowed preview_only delta shape |
|---|---|
clinical_strictness >= 4 with no named-profile match |
switch to stricter CA posture only; do not change unrelated weights |
reproducibility_priority >= 4 with no named-profile match |
raise Stage 4 emphasis only within predeclared policy bounds |
structured_limitations_requirement >= 4 with no named-profile match |
require stricter Stage 3 B2 partial-credit posture only |
| multiple strong answers with no named-profile match | combine only explicitly listed bounded deltas; do not infer new arithmetic outside documented policy fields |
These are active preview-only deltas in 1.8.0. They are not hidden similarity operations and they do not mutate the authoritative scan path.
11.2.4 Comparison Output
This subsection describes the immediate output of the intent-scale flow.
If the scale is used, the system should immediately show:
- chosen baseline profile
- recommended profile or preview delta
- score difference on the current repository
- tier difference on the current repository
- which policy dimensions changed
The key UX question is not:
"what settings changed?"
It is:
"what did the repository outcome change to, and why?"
11.2.5 Current Named Profile Definitions
The current implementation defines two named profiles:
defaultstrict_clinical_adjacency
Deferred until explicitly defined:
documentation_lenient- not active in the
1.8.0rule table
- not active in the
research_repo_baseline- not active in the
1.8.0rule table
- not active in the
biosecurity_cautious- not active in the
1.8.0rule table
- not active in the
reproducibility_first- intentionally deferred until an actual policy diff exists and a release-grade recommendation path is defined
Documented diff fields per active profile:
- stage weights
- clinical cap / hard-floor posture
- Stage 3 B2 strictness posture
- Stage 4 emphasis posture
Current starter diff:
| Profile | Stage weights | Clinical posture | B2 posture | Stage 4 posture |
|---|---|---|---|---|
default |
0.40 / 0.20 / 0.40 |
standard CA cap / hard-floor rules | structured boundary language for partial credit | current baseline |
strict_clinical_adjacency |
0.40 / 0.20 / 0.40 |
tighter ca_no_disclaimer_cap=60, tighter t0_hard_floor_cap=35 |
same as default |
baseline |
Any additional named profile must document its concrete diff here before it becomes eligible for CLI recommendation.
11.2.6 Next UX Step: simulate --profile-file
The next reasonable researcher-facing extension is a simulation-only profile file input.
Recommended shape:
stem policy simulate <repo> --profile-file my_profile.json
This should be allowed because it improves domain experimentation without weakening the authoritative scan contract.
Intended behavior:
- load a local profile file through the same schema and runtime validation path
- treat the file as simulation-only input
- do not register the file as an installed named profile automatically
- do not let
scan --policyorgate --policyconsume it on the authoritative path
Required guardrails:
- schema-valid before simulation starts
profile_read_modemust remainmirror_onlyorpreview_onlyfor external profile-file simulation- artifact output must clearly mark the file as local simulation input, not a packaged release profile
- profile hash should still be surfaced so simulation outputs remain comparable
Recommended artifact labels for this future path:
profile_name = external_profile_fileprofile_status = preview_onlyprofile_source = local_filepolicy_sha256 = <computed canonical hash>
The goal is to let researchers try domain-specific posture proposals without creating a backdoor around governed named-profile promotion.
11.3 Side-by-Side Simulation
This subsection describes the more general simulation surface, which may also be used outside the intent-scale interview.
The most important feedback loop is not the profile editor itself.
It is the comparison view.
Researchers should be able to see:
- default profile result
- custom profile result
- score delta
- tier delta
- cap / hard-floor delta
- which evidence lanes changed the outcome
The right question is not:
"what numbers did I change?"
It is:
"what review outcome changed, and why?"
11.4 Roles
The intended governance model is not "every researcher defines the official score alone."
Recommended roles:
researcher- explains domain priorities
- surfaces false positives / false negatives
- evaluates whether the default policy fits the repository context
policy steward- maintains release-grade named profiles
- reviews score-affecting changes
- prevents silent drift between personal and team policy
tool- computes policy diff
- computes result diff
- records profile metadata in artifacts
This division keeps domain experts involved without sacrificing reproducibility.
11.4.1 Researcher Participation Rules
The participation model should stay simple:
- researchers may propose posture changes
- researchers may run
deriveandsimulate - researchers may edit personal or branch-local preview profiles for comparison work
- researchers should not directly redefine the release-grade default policy on the authoritative score path
The key distinction is:
preview_onlyandexperimentalare valid spaces for domain inputauthoritative_releaseis a governed release artifact
This means a researcher can legitimately say:
"for this domain, I want stricter clinical-adjacent treatment and stronger reproducibility emphasis"
but should not unilaterally convert that statement into:
- new official tier boundaries
- new default caps
- new score-bearing detector promotion
- new release-grade policy semantics
The system should therefore optimize for:
- easy posture expression
- easy repository-specific simulation
- hard-to-mutate official policy
11.4.2 Operating Principles
Operationally, the collaboration rule is:
- the researcher expresses domain priorities
- the tool translates those priorities into a visible named-profile recommendation or
preview_onlydelta - the policy steward decides whether that posture remains local, becomes experimental, or is promoted into a release-grade policy artifact
In practice:
- a researcher should be able to explore calibration without editing scanner code
- a steward should be able to reject score-affecting drift even when the local preview is reasonable
- artifacts should clearly distinguish personal preview from official release policy
The intended output is not "personalized truth."
The intended output is:
- a stable official score policy
- a transparent preview lane for domain-specific posture testing
- an explicit governance path between the two
11.4.3 Responsibility Matrix
| Action | Researcher | Policy steward | Tool |
|---|---|---|---|
| express domain posture | primary | optional review | guided input surface |
run stem policy derive |
primary | optional review | translates intent |
run stem policy simulate |
primary | optional review | computes baseline vs preview |
edit preview_only deltas for local exploration |
allowed | review optional | validates bounded deltas |
create or modify experimental named profiles |
propose | approve / reject | validates profile schema and metadata |
promote a profile to benchmark_candidate |
propose evidence | required approval | records status transition |
promote a profile to authoritative_release |
provide domain rationale | required approval | requires parity / benchmark metadata |
| change default release policy semantics | no unilateral authority | required owner | records artifact provenance |
| change score-bearing detector status | no unilateral authority | required owner | enforces transition metadata |
11.5 Promotion Path
Recommended progression:
- personal preview profile
- side-by-side run against real repository output
- review and comparison against default profile
- promotion to named team policy if approved
That promotion should update:
profile_nameprofile_statuspolicy_sha256- changelog / rationale references when score logic changes
11.5.1 Promotion Gates
The progression above should not be symbolic only.
Each transition should have an explicit gate:
| From | To | Minimum gate |
|---|---|---|
preview_only |
experimental |
profile file exists, schema-valid, bounded diff documented, repository-side simulation reviewed |
experimental |
benchmark_candidate |
compared against default on named fixtures or benchmark repos, intended score deltas explained, no hidden arithmetic |
benchmark_candidate |
authoritative_release |
parity or benchmark note completed, rationale updated, changelog entry prepared, steward approval recorded |
authoritative_release |
deprecated |
replacement or retirement note recorded, artifact comparability preserved |
The intent is to prevent one common failure mode:
a locally useful domain tweak quietly becoming the new official score policy without an explicit release decision
11.5.2 What Researchers Can Change Directly
Researchers should be allowed to directly change:
- intent-scale answers
- selected baseline profile for simulation
- branch-local
preview_onlydeltas inside documented bounds - explanatory notes attached to why a preview better matches the domain
Researchers should not directly change, on the authoritative path:
- default release profile semantics
- tier boundaries for the official score
- detector graduation state
- score-bearing penalty activation rules
- release-grade policy status labels
If a domain team wants one of those changes, the correct path is:
- simulate locally
- capture the proposed diff
- compare against default on real repositories
- propose promotion through the governed profile path
11.6 Interface Direction
Recommended near-term CLI additions:
stem policy liststem policy explain <name>stem scan <repo> --policy <name>
Recommended later UX additions:
- wizard-style policy derivation
- side-by-side simulation view
- profile diff explanation panel
simulate --profile-file <path>for governed local experimentation without named-profile promotion
The guiding rule is:
researchers should tune posture through explicit choices, not hidden arithmetic.
12. Implemented Milestones and Remaining Roadmap
1.6.5: mirror-only profile contract
- create the profile file
- keep scanner behavior unchanged
- expose profile metadata in JSON output
- verify the profile matches current release behavior exactly using a differential fixture set with gold outputs
Recommended fixture format:
{
"fixture_name": "default_profile_parity_small_repo",
"target_repo": "tests/fixtures/repos/small_repo_a",
"expected": {
"raw_score_before_floor": 67,
"final_score": 67,
"formal_tier": "T2 Caution",
"score_cap": null
}
}
Recommended fixture location:
tests/fixtures/calibration_profiles/
Phase 1 parity target fields:
raw_score_before_floorfinal_scoreformal_tierclassification.score_cap
1.6.6: policy visibility
- added
stem policy listandstem policy explain - added
--policy <name>on scan/gate/advisory workflows - surfaced selected profile metadata in stdout, Markdown, explain text, and PDF headers
1.6.7: derive/simulate preview
- added
stem policy derivefor auditable intent translation, now standardized on the governed1–5posture scale - added
stem policy simulate <repo>for baseline-vs-preview outcome comparison - kept derive/simulate outputs mirror-only so authoritative scoring remains unchanged
1.6.8: preview hardening and citation readiness
- kept mirror-only scan scoring unchanged while hardening preview simulation against future profile drift
- aligned
simulatewith profile-aware C1 penalty behavior instead of assuming scanner constants forever - revalidated
preview_onlyprofiles after bounded deltas are applied - strengthened mirror-only wording across CLI and report surfaces so
scan --policyis not confused withpolicy simulate - added
CITATION.cffand.zenodo.jsonso release artifacts are ready for DOI-backed citation once GitHub releases are archived by Zenodo
Remaining roadmap
- authoritative read-through of policy weights/caps/thresholds
- additional release-grade named profiles beyond
strict_clinical_adjacency - explicit read-through for currently declared mirror-only fields such as
stage_3_policy.b2_partial_credit_mode ca-taxonomy-vNgovernance policy so runtime trigger-set changes are versioned as first-class release events- Phase 2 target release remains intentionally unset until parity fixtures, differential tests, and rollback notes are ready for the first score-authoritative read-through patch
- future score-affecting policy changes require:
- profile update
- rationale update
- changelog entry
- benchmark note when relevant
This keeps calibration governance ahead of personalization while avoiding a risky “big bang” rewrite.
13. Non-Goals
This architecture does not aim to:
- let users personalize trust scores
- introduce hidden model-based calibration
- turn regulatory mapping into a numerical score multiplier
- make advisory output part of the formal score
- replace benchmark or manual review with profile editing
It only aims to make calibration easier to maintain without weakening lane boundaries.
15. Draft: reproducibility_first
reproducibility_first is still deferred as an active named recommendation, but a draft posture is reasonable now.
Draft intent:
- researchers want reproducibility evidence to matter more in review posture
- they do not necessarily want stricter clinical caps
- they are usually asking for stronger replication scrutiny, not a different claim-risk philosophy
Draft profile shape:
| Profile | Stage weights | Clinical posture | B2 posture | Stage 4 posture |
|---|---|---|---|---|
reproducibility_first (draft) |
0.40 / 0.20 / 0.40 |
same as default |
same as default |
stronger_than_baseline |
Recommended initial metadata:
profile_status = experimentalprofile_read_mode = mirror_onlypolicy_version = ca-policy-1.0-repro-first
Important limitation:
In the current engine, Stage 4 remains a separate replication lane and does not alter score.final_score.
That means reproducibility_first is currently useful for:
- simulation posture comparison
- replication-lane emphasis
- future promotion groundwork
but not yet for:
- score-authoritative final-score change on the formal scan path
So the draft should remain deferred until one of the following is true:
- there is a release-grade rationale for how stronger replication posture should affect official review outcomes
- the simulation/report surface exposes a meaningful replication-posture difference without pretending it changed the formal score
- a future Phase 2 read-through explicitly defines what parts of Stage 4 posture become authoritative and under what governance rule
Until then, high reproducibility intent should continue to:
- fall back to
preview_only - raise Stage 4 emphasis in simulation
- avoid pretending that an official release-grade named profile already exists
14. Final Position
STEM BIO-AI now has a real calibration architecture.
The correct mechanism is:
a versioned calibration profile with explicit promotion rules
not:
ad hoc runtime tuning knobs
If the system is serious about preserving the distinction between:
- formal score
- diagnostics
- regulatory traceability
- advisory
then calibration must be governed with the same discipline.
The right outcome is not “more adjustable.”
The right outcome is:
more maintainable, without becoming easier to drift.