File size: 33,662 Bytes
6a1cba7
 
d647970
 
6a1cba7
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d647970
6a1cba7
d647970
6a1cba7
d647970
6a1cba7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
# STEM BIO-AI Calibration Profile Architecture

Version: 1.8.0
Status: implemented mirror-only calibration contract with derive/simulate preview surfaces; 1.8.0 preview hardening complete; authoritative read-through remains future work

---

## 1. Current State

STEM BIO-AI already separates formal scoring, deterministic diagnostics, regulatory traceability, and AI advisory into distinct lanes.

As of `1.8.0`, the repository ships a real calibration architecture:

- packaged profiles in `policy/`
- schema and runtime validation
- result metadata surfacing for active profile identity
- CLI policy visibility via `stem policy list`, `stem policy explain`, and `--policy <name>`
- researcher-intent preview surfaces via `stem policy derive` and `stem policy simulate`

What is still not fully separated is the **authoritative score read-through surface**:

- stage weights
- tier boundaries
- clinical caps and hard floors
- evidence-only versus score-authoritative detector status
- reasoning-model status labels

In `1.8.0`, most score-affecting values are still implemented as runtime constants plus prose in `SCORING_RATIONALE.md`, even though mirror-only profile metadata, CLI-visible profile selection, and derive/simulate preview surfaces are already live. That is acceptable for the current release line, but it still creates a long-term maintenance risk:

> if calibration values are easy to change but hard to govern, the architecture will drift even if the lane boundaries remain conceptually correct.

This document describes the **implemented versioned calibration profile architecture** and the remaining governed path to authoritative read-through.

---

## 2. Problem Statement

As advisory systems become stronger, teams usually feel pressure to:

- raise or lower tier thresholds
- relax or tighten clinical caps
- promote diagnostics from evidence-only into score-bearing logic
- add new penalties or soften old ones
- reinterpret reasoning or advisory outputs as scoring evidence

If those changes happen ad hoc in code, three problems appear:

1. the formal score becomes harder to reproduce across versions
2. policy drift hides inside implementation edits
3. advisory or diagnostic signals may slowly leak into the formal score without an explicit governance decision

The issue is not whether calibration should ever change.

The issue is whether calibration changes are:

- versioned
- reviewable
- explainable in artifacts
- bounded by explicit promotion rules

---

## 3. Design Goal

The goal is **not** to let users arbitrarily tune the score from the CLI.

The goal is to make calibration:

- easy to inspect
- easy to version
- easy to compare between releases
- hard to mutate accidentally

In short:

> STEM BIO-AI needs easy maintenance, not easy drift.

---

## 4. Core Principle

Calibration should become a **policy object**, not a scattered implementation detail.

That means:

- the active scoring profile is represented in a versioned file
- result artifacts record which profile was used
- policy changes become visible release events
- score-affecting changes require explicit promotion criteria

This preserves the current architectural discipline:

- formal score remains deterministic
- diagnostics can stay evidence-only until promoted
- advisory remains structurally subordinate to the score
- regulatory mapping remains traceability support, not a score multiplier

---

## 5. Implemented Shape

Current packaged profile files:

`policy/scoring_profile.default.v1.json`
`policy/scoring_profile.strict_clinical_adjacency.v1.json`

Deferred profiles:

- `reproducibility_first`
- `documentation_lenient`
- `research_repo_baseline`
- `biosecurity_cautious`

Important restriction:

- normal users should select from named profiles
- normal users should **not** pass arbitrary weights or tier cutoffs on the command line

Good:

```bash
stem scan <repo> --policy default
stem scan <repo> --policy benchmark-candidate
```

Bad:

```bash
stem scan <repo> --stage1-weight 0.35 --t3-threshold 68 --cap 72
```

The first preserves governance.  
The second turns the tool into an untracked tuning console.

---

## 6. Current Profile Contract

Current shipped fields:

```json
{
  "policy_schema_version": "1",
  "policy_version": "ca-policy-1.0",
  "tool_version_introduced": "1.6.5",
  "tool_version_last_validated": "1.8.0",
  "profile_name": "default",
  "profile_status": "authoritative_release",
  "profile_read_mode": "mirror_only",
  "weights": {
    "stage_1_percent": 40,
    "stage_2r_percent": 20,
    "stage_3_percent": 40
  },
  "stage_baselines": {
    "stage_1": 60,
    "stage_2r": 60,
    "stage_3": 0
  },
  "tier_policy": {
    "tier_names": ["T0", "T1", "T2", "T3", "T4"],
    "tier_boundaries": [40, 55, 70, 85],
    "boundary_semantics": "left_closed_right_open",
    "score_domain": "integer_0_to_100"
  },
  "clinical_policy": {
    "ca_no_disclaimer_cap": 69,
    "t0_hard_floor_cap": 39
  },
  "code_integrity_policy": {
    "C1_penalty": 10,
    "C2_score_affecting": false,
    "C3_score_affecting": false,
    "C4_score_affecting": false
  },
  "stage_3_policy": {
    "normalization": {
      "kind": "linear_round",
      "raw_max": 80,
      "target_max": 100,
      "rounding": "half_up_int"
    }
  },
  "diagnostic_policy": {
    "BIO_smiles_surface_integrity": "evidence_only",
    "BIO_smiles_rdkit_validation": "evidence_only",
    "BIO_smiles_parser_guard": "evidence_only",
    "BIO_silent_mock_fallback": "evidence_only",
    "BIO_traceability_manifest_surface": "evidence_only",
    "BIO_subprocess_run_trace": "evidence_only"
  },
  "reasoning_policy": {
    "status": "diagnostic_only_uncalibrated_initial_prior",
    "score_integration": "forbidden"
  },
  "governance_sources": {
    "ca_taxonomy_version": "ca-taxonomy-v1",
    "ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
  }
}
```

This is the active shipped schema family in `1.8.0`.

Schema notes:

- weights should be stored as integer percentages, not floating-point fractions
- tier boundaries should be stored once as a single ordered array
- normalization should be represented as named semantics plus parameters, not a free-form expression string
- `policy_version` should be independent from the tool release version
- `profile_read_mode` must distinguish mirror-only exposure from authoritative runtime loading
- `stage_3_policy.b2_partial_credit_mode` is currently a declared mirror-only profile field; authoritative Stage 3 B2 scoring in `1.8.0` still follows the hardcoded scanner path and does not yet read this value directly
- `governance_sources.ca_taxonomy_version` must increment whenever runtime CA trigger membership, severity mapping, or cap-relevant phrase semantics change

Current `profile_status` state set:

- `preview_only`
- `experimental`
- `benchmark_candidate`
- `authoritative_release`
- `deprecated`

Current status transition path:

`preview_only -> experimental -> benchmark_candidate -> authoritative_release -> deprecated`

Other transitions should require an explicit migration note.

---

## 7. Artifact Requirements

Every result object should record:

- `policy_schema_version`
- `policy_version`
- `profile_name`
- `profile_status`
- `profile_read_mode`
- `policy_sha256`

Why:

- two runs are not meaningfully comparable unless they share the same active profile
- policy drift should be visible in the artifact itself
- benchmark comparisons should be able to say whether differences came from repository evidence or policy revision
- mirror-only and authoritative-read runs must not look equivalent in artifacts

`policy_sha256` must be defined precisely.

Recommended definition:

- canonicalize the policy JSON using sorted keys and UTF-8 encoding
- exclude the `policy_sha256` field itself from the hash input
- hash the canonicalized policy file bytes only

In `mirror_only` mode, the profile file may leave `policy_sha256` as `null`.

The runtime artifact should still surface the computed canonical hash so profile comparisons remain stable during Phase 1.

This hash does **not** claim to represent every runtime governance source.

Instead, runtime governance dependencies such as the CA taxonomy should be surfaced separately under `governance_sources`.

Recommended JSON example:

```json
"calibration_profile": {
  "policy_schema_version": "1",
  "policy_version": "ca-policy-1.0",
  "profile_name": "default",
  "profile_status": "authoritative_release",
  "profile_read_mode": "mirror_only",
  "policy_sha256": "..."
},
"governance_sources": {
  "ca_taxonomy_version": "ca-taxonomy-v1",
  "ca_taxonomy_source": "runtime_regex_hardcoded_in_scanner_py"
}
```

---

## 8. Diagnostics Graduation Policy

The hardest maintenance problem is not weight tuning.

It is detector promotion:

> when does an evidence-only detector become score-authoritative?

Recommended detector states:

- `evidence_only`
- `candidate_scored`
- `scored`
- `deprecated`

Recommended transition rules:

| From | To | Allowed? | Notes |
|---|---|---|---|
| `evidence_only` | `candidate_scored` | yes | requires promotion gate below |
| `candidate_scored` | `scored` | yes | requires promotion gate below |
| `candidate_scored` | `evidence_only` | yes | allowed if benchmark review regresses confidence |
| `scored` | `candidate_scored` | yes | allowed for rollback after release observation |
| `scored` | `deprecated` | yes | allowed when detector is retired |
| `evidence_only` | `deprecated` | yes | allowed when detector is abandoned |
| `deprecated` | any active state | no by default | require explicit redesign note |

Recommended promotion gate before moving from `evidence_only` to `candidate_scored`:

1. commit-pinned benchmark fixtures exist for at least `N >= 20` repositories
2. detector output is reproducible across `3` consecutive identical runs
3. false-positive review has been documented with observed `false_positive_rate <= 0.05`
4. a release note explains what changed
5. `SCORING_RATIONALE.md` is updated if the detector affects score logic

Recommended promotion gate before moving from `candidate_scored` to `scored`:

1. benchmark evidence shows the detector improves review precision on the maintained fixture set
2. at least one release cycle of observation has occurred
3. the profile change is versioned as a policy revision

This is the governance mechanism that prevents “AI got more capable, so we quietly started scoring with it.”

---

## 9. Advisory Boundary Rule

The calibration profile should explicitly state that advisory output cannot rewrite the formal score unless a future architecture intentionally changes that rule.

Recommended field:

```json
"reasoning_policy": {
  "status": "diagnostic_only_uncalibrated_initial_prior",
  "score_integration": "forbidden"
}
```

That matters because boundary failures often begin as convenience:

- a provider looks helpful
- the advisory output seems more nuanced
- a team wants to “just incorporate it a little”

Once that happens without a versioned policy change, the formal score stops being what the architecture claims it is.

---

## 10. CLI Policy Surface

Recommended CLI behavior:

- `--policy default`
- `--policy <named_profile>`
- `--list-policies`

Not recommended:

- direct numeric overrides for weights, thresholds, caps, or detector promotion state

Developer-only experimental override support is acceptable if all of the following are true:

- it is clearly marked non-authoritative
- it writes a different `profile_status`
- output artifacts visibly say experimental policy was used
- it is excluded from default examples and documentation

---

## 11. Researcher UX and Participation Model

The most important UX constraint is this:

> researchers should be able to influence policy intent without turning the CLI into a free-form scoring console.

That means STEM BIO-AI should prefer:

- named profile templates
- guided questions
- side-by-side score diffs
- explicit promotion to shared policy

over:

- raw numeric knobs
- hidden threshold editing
- untracked one-off scoring profiles

### 11.1 Starting Point: Profile Templates

The first interaction should not be:

> "enter your own weights and caps"

It should be:

> "which evaluation posture best matches your repository context?"

Current active named profiles:

- `default`
- `strict_clinical_adjacency`

Deferred named profiles:

- `reproducibility_first`
- `research_repo_baseline`
- `documentation_lenient`
- `biosecurity_cautious`

These names are easier for researchers to reason about than raw numbers.

### 11.2 Guided Policy Builder

After a template is selected, the next layer should be a guided builder rather than free-form editing.

Examples of acceptable questions:

- "Should code-integrity evidence outweigh README surface evidence?"
- "Should clinical-adjacent claims trigger stricter caps?"
- "Should bias/limitations require structured sections rather than term presence?"
- "Should replication evidence matter more for your workflow?"

The user answers policy questions.

The system translates them into profile deltas.

This preserves usability while keeping the policy surface inspectable.

### 11.2.1 Researcher Intent Scale

Before users touch any named policy, STEM BIO-AI should reduce the interpretation gap between:

- what the researcher actually cares about
- what the default profile currently emphasizes

The implemented mechanism is a **researcher intent layer** built around short `1–5` scales.

Important boundary:

> the `1–5` scale is a UX input surface, not part of the formal score engine.

In other words:

- users do **not** set formal weights directly
- users do **not** set tier thresholds directly
- users do **not** generate a score by summing their answers

Instead, the scale helps the system infer which existing policy posture is closer to the researcher's intent.

Recommended scale interpretation:

- `1` = minimal emphasis
- `2` = light emphasis
- `3` = moderate emphasis
- `4` = strong emphasis
- `5` = very strong emphasis

Current question areas:

- how strict clinical-adjacent claims should be treated
- whether code-integrity evidence should outweigh README/documentation evidence
- how much reproducibility evidence should matter
- whether structured limitations should be required before partial credit is awarded

This approach borrows the usability advantage of Likert-style scales without turning the scanner into a free-form tuning instrument.

### 11.2.2 Why the Scale Belongs in UX, Not Scoring

Researchers can usually answer:

> "clinical-adjacent claims should be treated very strictly"

more reliably than:

> "set the CA cap delta to -12 and reduce the Stage 1 weight by 0.05"

That is why the scale should live in the interview layer.

The formal engine should still consume:

- named profiles
- explicit policy objects
- versioned calibration state

The scale is only a translation surface between human intent and governed policy.

### 11.2.3 Translation Rule

The system should map researcher answers to one of three outcomes:

1. recommend an existing named profile
2. show a preview-only profile delta
3. show that the default profile already matches the stated posture under an explicit rule

This is safer than letting the user edit raw scoring parameters directly.

The current implementation uses an auditable rule table instead of a hidden similarity function.

Current intent variables:

- `clinical_strictness`
- `code_integrity_priority`
- `reproducibility_priority`
- `structured_limitations_requirement`

Current `1.8.0` decision rules:

| Condition | Outcome |
|---|---|
| `clinical_strictness >= 4` and `reproducibility_priority <= 3` | recommend `strict_clinical_adjacency` |
| all four values are `2` or `3` | keep `default` |
| no named profile rule matches | generate `preview_only` profile delta from explicit bounded deltas only |

This narrow table is intentional. It keeps the translation layer visible, reviewable, and testable without pretending that every strong posture already has a release-grade named profile. In particular, `reproducibility_first` remains deferred in `1.8.0`; high reproducibility answers still fall back to `preview_only` Stage 4 emphasis rather than a named recommendation.

Rule priority:

- evaluate rules top-down
- stop at the first named-profile match
- if multiple strong postures are simultaneously requested and no single named profile dominates, fall back to `preview_only`

Example:

- `clinical_strictness = 4`
- `reproducibility_priority = 4`

This should fall back to `preview_only` in the initial implementation rather than pretending one named posture has clear priority.

Lower-bound meaning:

- `1` means minimal emphasis
- `1` does **not** remove or disable an axis
- therefore the minimum scale value still participates in threshold checks such as `<= 2`

Current "default already matches" rule:

- if the selected baseline is `default`
- and all four intent variables are in the `2..3` range
- and no explicit named-profile rule is triggered

then the system should report that the default profile already matches the stated posture closely enough to avoid a custom preview.

Current `preview_only` boundary:

- do **not** compute nearest-profile distance
- do **not** infer hidden similarity scores
- do **not** mutate arbitrary raw numbers

Instead:

- start from the selected baseline profile
- apply explicit bounded deltas associated with the triggered answers
- mark the result as `preview_only`

This keeps the intent layer auditable during the first implementation cycle.

Current bounded deltas used in preview-only mode:

| Triggered answer | Allowed `preview_only` delta shape |
|---|---|
| `clinical_strictness >= 4` with no named-profile match | switch to stricter CA posture only; do not change unrelated weights |
| `reproducibility_priority >= 4` with no named-profile match | raise Stage 4 emphasis only within predeclared policy bounds |
| `structured_limitations_requirement >= 4` with no named-profile match | require stricter Stage 3 B2 partial-credit posture only |
| multiple strong answers with no named-profile match | combine only explicitly listed bounded deltas; do not infer new arithmetic outside documented policy fields |

These are active preview-only deltas in `1.8.0`. They are not hidden similarity operations and they do not mutate the authoritative scan path.

### 11.2.4 Comparison Output

This subsection describes the immediate output of the intent-scale flow.

If the scale is used, the system should immediately show:

- chosen baseline profile
- recommended profile or preview delta
- score difference on the current repository
- tier difference on the current repository
- which policy dimensions changed

The key UX question is not:

> "what settings changed?"

It is:

> "what did the repository outcome change to, and why?"

### 11.2.5 Current Named Profile Definitions

The current implementation defines two named profiles:

- `default`
- `strict_clinical_adjacency`

Deferred until explicitly defined:

- `documentation_lenient`
  - not active in the `1.8.0` rule table
- `research_repo_baseline`
  - not active in the `1.8.0` rule table
- `biosecurity_cautious`
  - not active in the `1.8.0` rule table
- `reproducibility_first`
  - intentionally deferred until an actual policy diff exists and a release-grade recommendation path is defined

Documented diff fields per active profile:

- stage weights
- clinical cap / hard-floor posture
- Stage 3 B2 strictness posture
- Stage 4 emphasis posture

Current starter diff:

| Profile | Stage weights | Clinical posture | B2 posture | Stage 4 posture |
|---|---|---|---|---|
| `default` | `0.40 / 0.20 / 0.40` | standard CA cap / hard-floor rules | structured boundary language for partial credit | current baseline |
| `strict_clinical_adjacency` | `0.40 / 0.20 / 0.40` | tighter `ca_no_disclaimer_cap=60`, tighter `t0_hard_floor_cap=35` | same as `default` | `baseline` |

Any additional named profile must document its concrete diff here before it becomes eligible for CLI recommendation.

### 11.2.6 Next UX Step: `simulate --profile-file`

The next reasonable researcher-facing extension is a simulation-only profile file input.

Recommended shape:

```bash
stem policy simulate <repo> --profile-file my_profile.json
```

This should be allowed because it improves domain experimentation without weakening the authoritative scan contract.

Intended behavior:

- load a local profile file through the same schema and runtime validation path
- treat the file as simulation-only input
- do not register the file as an installed named profile automatically
- do not let `scan --policy` or `gate --policy` consume it on the authoritative path

Required guardrails:

- schema-valid before simulation starts
- `profile_read_mode` must remain `mirror_only` or `preview_only` for external profile-file simulation
- artifact output must clearly mark the file as local simulation input, not a packaged release profile
- profile hash should still be surfaced so simulation outputs remain comparable

Recommended artifact labels for this future path:

- `profile_name = external_profile_file`
- `profile_status = preview_only`
- `profile_source = local_file`
- `policy_sha256 = <computed canonical hash>`

The goal is to let researchers try domain-specific posture proposals without creating a backdoor around governed named-profile promotion.

### 11.3 Side-by-Side Simulation

This subsection describes the more general simulation surface, which may also be used outside the intent-scale interview.

The most important feedback loop is not the profile editor itself.

It is the comparison view.

Researchers should be able to see:

- default profile result
- custom profile result
- score delta
- tier delta
- cap / hard-floor delta
- which evidence lanes changed the outcome

The right question is not:

> "what numbers did I change?"

It is:

> "what review outcome changed, and why?"

### 11.4 Roles

The intended governance model is not "every researcher defines the official score alone."

Recommended roles:

- `researcher`
  - explains domain priorities
  - surfaces false positives / false negatives
  - evaluates whether the default policy fits the repository context
- `policy steward`
  - maintains release-grade named profiles
  - reviews score-affecting changes
  - prevents silent drift between personal and team policy
- `tool`
  - computes policy diff
  - computes result diff
  - records profile metadata in artifacts

This division keeps domain experts involved without sacrificing reproducibility.

### 11.4.1 Researcher Participation Rules

The participation model should stay simple:

- researchers may propose posture changes
- researchers may run `derive` and `simulate`
- researchers may edit personal or branch-local preview profiles for comparison work
- researchers should not directly redefine the release-grade default policy on the authoritative score path

The key distinction is:

- `preview_only` and `experimental` are valid spaces for domain input
- `authoritative_release` is a governed release artifact

This means a researcher can legitimately say:

> "for this domain, I want stricter clinical-adjacent treatment and stronger reproducibility emphasis"

but should not unilaterally convert that statement into:

- new official tier boundaries
- new default caps
- new score-bearing detector promotion
- new release-grade policy semantics

The system should therefore optimize for:

- easy posture expression
- easy repository-specific simulation
- hard-to-mutate official policy

### 11.4.2 Operating Principles

Operationally, the collaboration rule is:

1. the researcher expresses domain priorities
2. the tool translates those priorities into a visible named-profile recommendation or `preview_only` delta
3. the policy steward decides whether that posture remains local, becomes experimental, or is promoted into a release-grade policy artifact

In practice:

- a researcher should be able to explore calibration without editing scanner code
- a steward should be able to reject score-affecting drift even when the local preview is reasonable
- artifacts should clearly distinguish personal preview from official release policy

The intended output is not "personalized truth."

The intended output is:

- a stable official score policy
- a transparent preview lane for domain-specific posture testing
- an explicit governance path between the two

### 11.4.3 Responsibility Matrix

| Action | Researcher | Policy steward | Tool |
|---|---|---|---|
| express domain posture | primary | optional review | guided input surface |
| run `stem policy derive` | primary | optional review | translates intent |
| run `stem policy simulate` | primary | optional review | computes baseline vs preview |
| edit `preview_only` deltas for local exploration | allowed | review optional | validates bounded deltas |
| create or modify `experimental` named profiles | propose | approve / reject | validates profile schema and metadata |
| promote a profile to `benchmark_candidate` | propose evidence | required approval | records status transition |
| promote a profile to `authoritative_release` | provide domain rationale | required approval | requires parity / benchmark metadata |
| change default release policy semantics | no unilateral authority | required owner | records artifact provenance |
| change score-bearing detector status | no unilateral authority | required owner | enforces transition metadata |

### 11.5 Promotion Path

Recommended progression:

1. personal preview profile
2. side-by-side run against real repository output
3. review and comparison against default profile
4. promotion to named team policy if approved

That promotion should update:

- `profile_name`
- `profile_status`
- `policy_sha256`
- changelog / rationale references when score logic changes

### 11.5.1 Promotion Gates

The progression above should not be symbolic only.

Each transition should have an explicit gate:

| From | To | Minimum gate |
|---|---|---|
| `preview_only` | `experimental` | profile file exists, schema-valid, bounded diff documented, repository-side simulation reviewed |
| `experimental` | `benchmark_candidate` | compared against default on named fixtures or benchmark repos, intended score deltas explained, no hidden arithmetic |
| `benchmark_candidate` | `authoritative_release` | parity or benchmark note completed, rationale updated, changelog entry prepared, steward approval recorded |
| `authoritative_release` | `deprecated` | replacement or retirement note recorded, artifact comparability preserved |

The intent is to prevent one common failure mode:

> a locally useful domain tweak quietly becoming the new official score policy without an explicit release decision

### 11.5.2 What Researchers Can Change Directly

Researchers should be allowed to directly change:

- intent-scale answers
- selected baseline profile for simulation
- branch-local `preview_only` deltas inside documented bounds
- explanatory notes attached to why a preview better matches the domain

Researchers should not directly change, on the authoritative path:

- default release profile semantics
- tier boundaries for the official score
- detector graduation state
- score-bearing penalty activation rules
- release-grade policy status labels

If a domain team wants one of those changes, the correct path is:

1. simulate locally
2. capture the proposed diff
3. compare against default on real repositories
4. propose promotion through the governed profile path

### 11.6 Interface Direction

Recommended near-term CLI additions:

- `stem policy list`
- `stem policy explain <name>`
- `stem scan <repo> --policy <name>`

Recommended later UX additions:

- wizard-style policy derivation
- side-by-side simulation view
- profile diff explanation panel
- `simulate --profile-file <path>` for governed local experimentation without named-profile promotion

The guiding rule is:

> researchers should tune posture through explicit choices, not hidden arithmetic.

---

## 12. Implemented Milestones and Remaining Roadmap

### 1.6.5: mirror-only profile contract

- create the profile file
- keep scanner behavior unchanged
- expose profile metadata in JSON output
- verify the profile matches current release behavior exactly using a differential fixture set with gold outputs

Recommended fixture format:

```json
{
  "fixture_name": "default_profile_parity_small_repo",
  "target_repo": "tests/fixtures/repos/small_repo_a",
  "expected": {
    "raw_score_before_floor": 67,
    "final_score": 67,
    "formal_tier": "T2 Caution",
    "score_cap": null
  }
}
```

Recommended fixture location:

- `tests/fixtures/calibration_profiles/`

Phase 1 parity target fields:

- `raw_score_before_floor`
- `final_score`
- `formal_tier`
- `classification.score_cap`

### 1.6.6: policy visibility

- added `stem policy list` and `stem policy explain`
- added `--policy <name>` on scan/gate/advisory workflows
- surfaced selected profile metadata in stdout, Markdown, explain text, and PDF headers

### 1.6.7: derive/simulate preview

- added `stem policy derive` for auditable intent translation, now standardized on the governed `1–5` posture scale
- added `stem policy simulate <repo>` for baseline-vs-preview outcome comparison
- kept derive/simulate outputs mirror-only so authoritative scoring remains unchanged

### 1.6.8: preview hardening and citation readiness

- kept mirror-only scan scoring unchanged while hardening preview simulation against future profile drift
- aligned `simulate` with profile-aware C1 penalty behavior instead of assuming scanner constants forever
- revalidated `preview_only` profiles after bounded deltas are applied
- strengthened mirror-only wording across CLI and report surfaces so `scan --policy` is not confused with `policy simulate`
- added `CITATION.cff` and `.zenodo.json` so release artifacts are ready for DOI-backed citation once GitHub releases are archived by Zenodo

### Remaining roadmap

- authoritative read-through of policy weights/caps/thresholds
- additional release-grade named profiles beyond `strict_clinical_adjacency`
- explicit read-through for currently declared mirror-only fields such as `stage_3_policy.b2_partial_credit_mode`
- `ca-taxonomy-vN` governance policy so runtime trigger-set changes are versioned as first-class release events
- Phase 2 target release remains intentionally unset until parity fixtures, differential tests, and rollback notes are ready for the first score-authoritative read-through patch
- future score-affecting policy changes require:
  - profile update
  - rationale update
  - changelog entry
  - benchmark note when relevant

This keeps calibration governance ahead of personalization while avoiding a risky “big bang” rewrite.

---

## 13. Non-Goals

This architecture does **not** aim to:

- let users personalize trust scores
- introduce hidden model-based calibration
- turn regulatory mapping into a numerical score multiplier
- make advisory output part of the formal score
- replace benchmark or manual review with profile editing

It only aims to make calibration easier to maintain **without weakening lane boundaries**.

---

## 15. Draft: `reproducibility_first`

`reproducibility_first` is still deferred as an active named recommendation, but a draft posture is reasonable now.

Draft intent:

- researchers want reproducibility evidence to matter more in review posture
- they do not necessarily want stricter clinical caps
- they are usually asking for stronger replication scrutiny, not a different claim-risk philosophy

Draft profile shape:

| Profile | Stage weights | Clinical posture | B2 posture | Stage 4 posture |
|---|---|---|---|---|
| `reproducibility_first` (draft) | `0.40 / 0.20 / 0.40` | same as `default` | same as `default` | `stronger_than_baseline` |

Recommended initial metadata:

- `profile_status = experimental`
- `profile_read_mode = mirror_only`
- `policy_version = ca-policy-1.0-repro-first`

Important limitation:

In the current engine, Stage 4 remains a separate replication lane and does not alter `score.final_score`.

That means `reproducibility_first` is currently useful for:

- simulation posture comparison
- replication-lane emphasis
- future promotion groundwork

but not yet for:

- score-authoritative final-score change on the formal scan path

So the draft should remain deferred until one of the following is true:

1. there is a release-grade rationale for how stronger replication posture should affect official review outcomes
2. the simulation/report surface exposes a meaningful replication-posture difference without pretending it changed the formal score
3. a future Phase 2 read-through explicitly defines what parts of Stage 4 posture become authoritative and under what governance rule

Until then, high reproducibility intent should continue to:

- fall back to `preview_only`
- raise Stage 4 emphasis in simulation
- avoid pretending that an official release-grade named profile already exists

---

## 14. Final Position

STEM BIO-AI now has a real calibration architecture.

The correct mechanism is:

> a versioned calibration profile with explicit promotion rules

not:

> ad hoc runtime tuning knobs

If the system is serious about preserving the distinction between:

- formal score
- diagnostics
- regulatory traceability
- advisory

then calibration must be governed with the same discipline.

The right outcome is not “more adjustable.”

The right outcome is:

**more maintainable, without becoming easier to drift.**