Codex Claude Opus 4.7 commited on
Commit
175e882
·
1 Parent(s): 226dec6

Add per-gene / per-variant-type validation breakdown script

Browse files

scripts/per_gene_breakdown.py slices any validation report (P/VUS/B
adjacent-tier) by gene, variant type (missense/splice/indel/synonymous
inferred from HGVS), and review status. This is the stratification
analysis a lab director / reviewer demands after seeing a headline
concordance number.

Output for the deterministic 87.4% run, saved to
docs/per_gene_breakdown_1000.json:

Per-variant-type — missense is the weakest at 83.1% (658 variants),
everything else is 92-97%. The missense gap accounts for almost the
entire overall headline drop; it's also where literature criteria
(PS3, PP1, PM3) matter most.

Worst-performing genes (n ≥ 3): ZBTB20 0%, COL1A1 0%, GRIN2B 0%, MYH7
33%. Inspection reveals the common pattern: PM2_supporting + PP5_strong
totals +5 Bayesian points, just below the LP threshold of +6. This is a
systematic miscalibration of PM2 strength — Richards 2015 specified
MODERATE; the codebase has the constant set to moderate but hardcodes
"supporting" in the score_population path. Fix pending after the RAG
validation completes (avoiding invalidating the in-progress run).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

docs/per_gene_breakdown_1000.json ADDED
@@ -0,0 +1,1137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "source_report": "docs/clinical_validation_results_1000.json",
3
+ "headline_concordance": 0.8741188318227593,
4
+ "total_scored": 993,
5
+ "per_gene": [
6
+ {
7
+ "key": "ATP6V1B1",
8
+ "n": 4,
9
+ "correct": 4,
10
+ "concordance": 1.0,
11
+ "tier_distribution": {
12
+ "Likely Pathogenic": 3,
13
+ "Pathogenic": 1
14
+ }
15
+ },
16
+ {
17
+ "key": "ABCC6",
18
+ "n": 3,
19
+ "correct": 2,
20
+ "concordance": 0.6666666666666666,
21
+ "tier_distribution": {
22
+ "Likely Benign": 1,
23
+ "Pathogenic": 1,
24
+ "Likely Pathogenic": 1
25
+ }
26
+ },
27
+ {
28
+ "key": "TCF12",
29
+ "n": 3,
30
+ "correct": 2,
31
+ "concordance": 0.6666666666666666,
32
+ "tier_distribution": {
33
+ "Pathogenic": 2,
34
+ "Uncertain Significance": 1
35
+ }
36
+ },
37
+ {
38
+ "key": "MYH7",
39
+ "n": 3,
40
+ "correct": 1,
41
+ "concordance": 0.3333333333333333,
42
+ "tier_distribution": {
43
+ "Pathogenic": 2,
44
+ "Likely Pathogenic": 1
45
+ }
46
+ },
47
+ {
48
+ "key": "ZBTB20",
49
+ "n": 3,
50
+ "correct": 0,
51
+ "concordance": 0.0,
52
+ "tier_distribution": {
53
+ "Likely Pathogenic": 2,
54
+ "Pathogenic": 1
55
+ }
56
+ },
57
+ {
58
+ "key": "BCKDHB",
59
+ "n": 3,
60
+ "correct": 3,
61
+ "concordance": 1.0,
62
+ "tier_distribution": {
63
+ "Likely Benign": 1,
64
+ "Likely Pathogenic": 1,
65
+ "Pathogenic": 1
66
+ }
67
+ },
68
+ {
69
+ "key": "COL1A1",
70
+ "n": 3,
71
+ "correct": 0,
72
+ "concordance": 0.0,
73
+ "tier_distribution": {
74
+ "Pathogenic": 1,
75
+ "Likely Pathogenic": 2
76
+ }
77
+ },
78
+ {
79
+ "key": "PHF6",
80
+ "n": 3,
81
+ "correct": 3,
82
+ "concordance": 1.0,
83
+ "tier_distribution": {
84
+ "Pathogenic": 2,
85
+ "Uncertain Significance": 1
86
+ }
87
+ },
88
+ {
89
+ "key": "PALB2",
90
+ "n": 3,
91
+ "correct": 2,
92
+ "concordance": 0.6666666666666666,
93
+ "tier_distribution": {
94
+ "Likely Benign": 1,
95
+ "Pathogenic": 1,
96
+ "Likely Pathogenic": 1
97
+ }
98
+ },
99
+ {
100
+ "key": "AMN",
101
+ "n": 3,
102
+ "correct": 3,
103
+ "concordance": 1.0,
104
+ "tier_distribution": {
105
+ "Likely Pathogenic": 1,
106
+ "Benign": 1,
107
+ "Pathogenic": 1
108
+ }
109
+ },
110
+ {
111
+ "key": "HOGA1",
112
+ "n": 3,
113
+ "correct": 2,
114
+ "concordance": 0.6666666666666666,
115
+ "tier_distribution": {
116
+ "Likely Pathogenic": 1,
117
+ "Pathogenic": 1,
118
+ "Uncertain Significance": 1
119
+ }
120
+ },
121
+ {
122
+ "key": "LAMP2",
123
+ "n": 3,
124
+ "correct": 3,
125
+ "concordance": 1.0,
126
+ "tier_distribution": {
127
+ "Benign": 1,
128
+ "Pathogenic": 2
129
+ }
130
+ },
131
+ {
132
+ "key": "HNF1A",
133
+ "n": 3,
134
+ "correct": 3,
135
+ "concordance": 1.0,
136
+ "tier_distribution": {
137
+ "Likely Pathogenic": 2,
138
+ "Pathogenic": 1
139
+ }
140
+ },
141
+ {
142
+ "key": "ERCC4",
143
+ "n": 3,
144
+ "correct": 3,
145
+ "concordance": 1.0,
146
+ "tier_distribution": {
147
+ "Benign": 1,
148
+ "Pathogenic": 1,
149
+ "Likely Pathogenic": 1
150
+ }
151
+ },
152
+ {
153
+ "key": "SCN5A",
154
+ "n": 3,
155
+ "correct": 2,
156
+ "concordance": 0.6666666666666666,
157
+ "tier_distribution": {
158
+ "Likely Pathogenic": 1,
159
+ "Pathogenic": 1,
160
+ "Likely Benign": 1
161
+ }
162
+ },
163
+ {
164
+ "key": "FANCA",
165
+ "n": 2,
166
+ "correct": 2,
167
+ "concordance": 1.0,
168
+ "tier_distribution": {
169
+ "Likely Pathogenic": 1,
170
+ "Pathogenic": 1
171
+ }
172
+ },
173
+ {
174
+ "key": "CDKN1B",
175
+ "n": 2,
176
+ "correct": 2,
177
+ "concordance": 1.0,
178
+ "tier_distribution": {
179
+ "Likely Benign": 1,
180
+ "Uncertain Significance": 1
181
+ }
182
+ },
183
+ {
184
+ "key": "PKP2",
185
+ "n": 2,
186
+ "correct": 2,
187
+ "concordance": 1.0,
188
+ "tier_distribution": {
189
+ "Likely Pathogenic": 2
190
+ }
191
+ },
192
+ {
193
+ "key": "GATA5",
194
+ "n": 2,
195
+ "correct": 2,
196
+ "concordance": 1.0,
197
+ "tier_distribution": {
198
+ "Uncertain Significance": 2
199
+ }
200
+ },
201
+ {
202
+ "key": "CDKN1C",
203
+ "n": 2,
204
+ "correct": 1,
205
+ "concordance": 0.5,
206
+ "tier_distribution": {
207
+ "Benign": 1,
208
+ "Likely Pathogenic": 1
209
+ }
210
+ },
211
+ {
212
+ "key": "RNF14",
213
+ "n": 2,
214
+ "correct": 2,
215
+ "concordance": 1.0,
216
+ "tier_distribution": {
217
+ "Benign": 1,
218
+ "Uncertain Significance": 1
219
+ }
220
+ },
221
+ {
222
+ "key": "AMT",
223
+ "n": 2,
224
+ "correct": 2,
225
+ "concordance": 1.0,
226
+ "tier_distribution": {
227
+ "Likely Pathogenic": 1,
228
+ "Pathogenic": 1
229
+ }
230
+ },
231
+ {
232
+ "key": "HMGCL",
233
+ "n": 2,
234
+ "correct": 2,
235
+ "concordance": 1.0,
236
+ "tier_distribution": {
237
+ "Likely Pathogenic": 2
238
+ }
239
+ },
240
+ {
241
+ "key": "GAMT",
242
+ "n": 2,
243
+ "correct": 2,
244
+ "concordance": 1.0,
245
+ "tier_distribution": {
246
+ "Pathogenic": 2
247
+ }
248
+ },
249
+ {
250
+ "key": "GRIN2B",
251
+ "n": 2,
252
+ "correct": 0,
253
+ "concordance": 0.0,
254
+ "tier_distribution": {
255
+ "Likely Pathogenic": 1,
256
+ "Pathogenic": 1
257
+ }
258
+ },
259
+ {
260
+ "key": "ABCA4",
261
+ "n": 2,
262
+ "correct": 1,
263
+ "concordance": 0.5,
264
+ "tier_distribution": {
265
+ "Benign": 1,
266
+ "Likely Pathogenic": 1
267
+ }
268
+ },
269
+ {
270
+ "key": "SKIC3",
271
+ "n": 2,
272
+ "correct": 2,
273
+ "concordance": 1.0,
274
+ "tier_distribution": {
275
+ "Uncertain Significance": 1,
276
+ "Pathogenic": 1
277
+ }
278
+ },
279
+ {
280
+ "key": "CLCN1",
281
+ "n": 2,
282
+ "correct": 2,
283
+ "concordance": 1.0,
284
+ "tier_distribution": {
285
+ "Likely Pathogenic": 2
286
+ }
287
+ },
288
+ {
289
+ "key": "CREBBP",
290
+ "n": 2,
291
+ "correct": 1,
292
+ "concordance": 0.5,
293
+ "tier_distribution": {
294
+ "Likely Pathogenic": 1,
295
+ "Uncertain Significance": 1
296
+ }
297
+ },
298
+ {
299
+ "key": "MKS1",
300
+ "n": 2,
301
+ "correct": 1,
302
+ "concordance": 0.5,
303
+ "tier_distribution": {
304
+ "Pathogenic": 2
305
+ }
306
+ },
307
+ {
308
+ "key": "ACADM",
309
+ "n": 2,
310
+ "correct": 2,
311
+ "concordance": 1.0,
312
+ "tier_distribution": {
313
+ "Pathogenic": 1,
314
+ "Uncertain Significance": 1
315
+ }
316
+ },
317
+ {
318
+ "key": "SUCLA2",
319
+ "n": 2,
320
+ "correct": 2,
321
+ "concordance": 1.0,
322
+ "tier_distribution": {
323
+ "Uncertain Significance": 1,
324
+ "Likely Pathogenic": 1
325
+ }
326
+ },
327
+ {
328
+ "key": "APC",
329
+ "n": 2,
330
+ "correct": 2,
331
+ "concordance": 1.0,
332
+ "tier_distribution": {
333
+ "Likely Pathogenic": 1,
334
+ "Benign": 1
335
+ }
336
+ },
337
+ {
338
+ "key": "SYDE2",
339
+ "n": 2,
340
+ "correct": 2,
341
+ "concordance": 1.0,
342
+ "tier_distribution": {
343
+ "Uncertain Significance": 1,
344
+ "Benign": 1
345
+ }
346
+ },
347
+ {
348
+ "key": "KCNJ1",
349
+ "n": 2,
350
+ "correct": 1,
351
+ "concordance": 0.5,
352
+ "tier_distribution": {
353
+ "Uncertain Significance": 1,
354
+ "Likely Pathogenic": 1
355
+ }
356
+ },
357
+ {
358
+ "key": "PRKCSH",
359
+ "n": 2,
360
+ "correct": 2,
361
+ "concordance": 1.0,
362
+ "tier_distribution": {
363
+ "Likely Pathogenic": 1,
364
+ "Uncertain Significance": 1
365
+ }
366
+ },
367
+ {
368
+ "key": "MTRR",
369
+ "n": 2,
370
+ "correct": 2,
371
+ "concordance": 1.0,
372
+ "tier_distribution": {
373
+ "Uncertain Significance": 2
374
+ }
375
+ },
376
+ {
377
+ "key": "KDM6A",
378
+ "n": 2,
379
+ "correct": 2,
380
+ "concordance": 1.0,
381
+ "tier_distribution": {
382
+ "Pathogenic": 2
383
+ }
384
+ },
385
+ {
386
+ "key": "RBCK1",
387
+ "n": 2,
388
+ "correct": 2,
389
+ "concordance": 1.0,
390
+ "tier_distribution": {
391
+ "Benign": 1,
392
+ "Likely Benign": 1
393
+ }
394
+ },
395
+ {
396
+ "key": "MYH11",
397
+ "n": 2,
398
+ "correct": 2,
399
+ "concordance": 1.0,
400
+ "tier_distribution": {
401
+ "Likely Benign": 1,
402
+ "Likely Pathogenic": 1
403
+ }
404
+ },
405
+ {
406
+ "key": "NLRC4",
407
+ "n": 2,
408
+ "correct": 2,
409
+ "concordance": 1.0,
410
+ "tier_distribution": {
411
+ "Uncertain Significance": 1,
412
+ "Benign": 1
413
+ }
414
+ },
415
+ {
416
+ "key": "MACF1",
417
+ "n": 2,
418
+ "correct": 2,
419
+ "concordance": 1.0,
420
+ "tier_distribution": {
421
+ "Likely Benign": 2
422
+ }
423
+ },
424
+ {
425
+ "key": "APP",
426
+ "n": 2,
427
+ "correct": 2,
428
+ "concordance": 1.0,
429
+ "tier_distribution": {
430
+ "Uncertain Significance": 2
431
+ }
432
+ },
433
+ {
434
+ "key": "COL4A1",
435
+ "n": 2,
436
+ "correct": 1,
437
+ "concordance": 0.5,
438
+ "tier_distribution": {
439
+ "Likely Pathogenic": 1,
440
+ "Uncertain Significance": 1
441
+ }
442
+ },
443
+ {
444
+ "key": "NECTIN4",
445
+ "n": 2,
446
+ "correct": 2,
447
+ "concordance": 1.0,
448
+ "tier_distribution": {
449
+ "Uncertain Significance": 2
450
+ }
451
+ },
452
+ {
453
+ "key": "CPAP",
454
+ "n": 2,
455
+ "correct": 2,
456
+ "concordance": 1.0,
457
+ "tier_distribution": {
458
+ "Uncertain Significance": 1,
459
+ "Pathogenic": 1
460
+ }
461
+ },
462
+ {
463
+ "key": "EOGT",
464
+ "n": 2,
465
+ "correct": 2,
466
+ "concordance": 1.0,
467
+ "tier_distribution": {
468
+ "Benign": 2
469
+ }
470
+ },
471
+ {
472
+ "key": "FANCG",
473
+ "n": 2,
474
+ "correct": 2,
475
+ "concordance": 1.0,
476
+ "tier_distribution": {
477
+ "Likely Pathogenic": 2
478
+ }
479
+ },
480
+ {
481
+ "key": "HSD3B7",
482
+ "n": 2,
483
+ "correct": 2,
484
+ "concordance": 1.0,
485
+ "tier_distribution": {
486
+ "Benign": 1,
487
+ "Pathogenic": 1
488
+ }
489
+ },
490
+ {
491
+ "key": "ANO5",
492
+ "n": 2,
493
+ "correct": 2,
494
+ "concordance": 1.0,
495
+ "tier_distribution": {
496
+ "Likely Pathogenic": 1,
497
+ "Pathogenic": 1
498
+ }
499
+ },
500
+ {
501
+ "key": "SLC37A4",
502
+ "n": 2,
503
+ "correct": 2,
504
+ "concordance": 1.0,
505
+ "tier_distribution": {
506
+ "Pathogenic": 2
507
+ }
508
+ },
509
+ {
510
+ "key": "STAT1",
511
+ "n": 2,
512
+ "correct": 1,
513
+ "concordance": 0.5,
514
+ "tier_distribution": {
515
+ "Likely Pathogenic": 1,
516
+ "Benign": 1
517
+ }
518
+ },
519
+ {
520
+ "key": "NHS",
521
+ "n": 2,
522
+ "correct": 2,
523
+ "concordance": 1.0,
524
+ "tier_distribution": {
525
+ "Benign": 1,
526
+ "Likely Benign": 1
527
+ }
528
+ },
529
+ {
530
+ "key": "STAT3",
531
+ "n": 2,
532
+ "correct": 1,
533
+ "concordance": 0.5,
534
+ "tier_distribution": {
535
+ "Pathogenic": 1,
536
+ "Likely Benign": 1
537
+ }
538
+ },
539
+ {
540
+ "key": "GPR143",
541
+ "n": 2,
542
+ "correct": 2,
543
+ "concordance": 1.0,
544
+ "tier_distribution": {
545
+ "Pathogenic": 1,
546
+ "Benign": 1
547
+ }
548
+ },
549
+ {
550
+ "key": "BSCL2",
551
+ "n": 2,
552
+ "correct": 2,
553
+ "concordance": 1.0,
554
+ "tier_distribution": {
555
+ "Benign": 1,
556
+ "Likely Benign": 1
557
+ }
558
+ },
559
+ {
560
+ "key": "NPHS2",
561
+ "n": 2,
562
+ "correct": 1,
563
+ "concordance": 0.5,
564
+ "tier_distribution": {
565
+ "Likely Pathogenic": 2
566
+ }
567
+ },
568
+ {
569
+ "key": "SGCE",
570
+ "n": 2,
571
+ "correct": 2,
572
+ "concordance": 1.0,
573
+ "tier_distribution": {
574
+ "Pathogenic": 1,
575
+ "Likely Benign": 1
576
+ }
577
+ },
578
+ {
579
+ "key": "ATM",
580
+ "n": 2,
581
+ "correct": 2,
582
+ "concordance": 1.0,
583
+ "tier_distribution": {
584
+ "Pathogenic": 1,
585
+ "Likely Benign": 1
586
+ }
587
+ },
588
+ {
589
+ "key": "COCH",
590
+ "n": 2,
591
+ "correct": 2,
592
+ "concordance": 1.0,
593
+ "tier_distribution": {
594
+ "Likely Benign": 1,
595
+ "Likely Pathogenic": 1
596
+ }
597
+ },
598
+ {
599
+ "key": "GLI3",
600
+ "n": 2,
601
+ "correct": 2,
602
+ "concordance": 1.0,
603
+ "tier_distribution": {
604
+ "Pathogenic": 1,
605
+ "Benign": 1
606
+ }
607
+ },
608
+ {
609
+ "key": "MYO1E",
610
+ "n": 2,
611
+ "correct": 2,
612
+ "concordance": 1.0,
613
+ "tier_distribution": {
614
+ "Likely Benign": 1,
615
+ "Pathogenic": 1
616
+ }
617
+ },
618
+ {
619
+ "key": "CAMK2B",
620
+ "n": 2,
621
+ "correct": 1,
622
+ "concordance": 0.5,
623
+ "tier_distribution": {
624
+ "Likely Benign": 1,
625
+ "Pathogenic": 1
626
+ }
627
+ },
628
+ {
629
+ "key": "DOLK",
630
+ "n": 2,
631
+ "correct": 2,
632
+ "concordance": 1.0,
633
+ "tier_distribution": {
634
+ "Likely Benign": 2
635
+ }
636
+ },
637
+ {
638
+ "key": "ATIC",
639
+ "n": 2,
640
+ "correct": 2,
641
+ "concordance": 1.0,
642
+ "tier_distribution": {
643
+ "Likely Benign": 2
644
+ }
645
+ },
646
+ {
647
+ "key": "PHYH",
648
+ "n": 2,
649
+ "correct": 2,
650
+ "concordance": 1.0,
651
+ "tier_distribution": {
652
+ "Likely Pathogenic": 1,
653
+ "Benign": 1
654
+ }
655
+ },
656
+ {
657
+ "key": "AQP2",
658
+ "n": 2,
659
+ "correct": 2,
660
+ "concordance": 1.0,
661
+ "tier_distribution": {
662
+ "Pathogenic": 1,
663
+ "Likely Pathogenic": 1
664
+ }
665
+ },
666
+ {
667
+ "key": "SPTB",
668
+ "n": 2,
669
+ "correct": 2,
670
+ "concordance": 1.0,
671
+ "tier_distribution": {
672
+ "Pathogenic": 1,
673
+ "Likely Pathogenic": 1
674
+ }
675
+ },
676
+ {
677
+ "key": "MAGI2",
678
+ "n": 2,
679
+ "correct": 2,
680
+ "concordance": 1.0,
681
+ "tier_distribution": {
682
+ "Likely Benign": 2
683
+ }
684
+ },
685
+ {
686
+ "key": "TBC1D24",
687
+ "n": 2,
688
+ "correct": 2,
689
+ "concordance": 1.0,
690
+ "tier_distribution": {
691
+ "Pathogenic": 1,
692
+ "Likely Benign": 1
693
+ }
694
+ },
695
+ {
696
+ "key": "LAMA1",
697
+ "n": 2,
698
+ "correct": 2,
699
+ "concordance": 1.0,
700
+ "tier_distribution": {
701
+ "Pathogenic": 1,
702
+ "Likely Benign": 1
703
+ }
704
+ },
705
+ {
706
+ "key": "SOS1",
707
+ "n": 2,
708
+ "correct": 2,
709
+ "concordance": 1.0,
710
+ "tier_distribution": {
711
+ "Likely Pathogenic": 2
712
+ }
713
+ },
714
+ {
715
+ "key": "EVC",
716
+ "n": 2,
717
+ "correct": 0,
718
+ "concordance": 0.0,
719
+ "tier_distribution": {
720
+ "Likely Pathogenic": 2
721
+ }
722
+ },
723
+ {
724
+ "key": "PHEX",
725
+ "n": 2,
726
+ "correct": 2,
727
+ "concordance": 1.0,
728
+ "tier_distribution": {
729
+ "Likely Benign": 2
730
+ }
731
+ },
732
+ {
733
+ "key": "GAN",
734
+ "n": 2,
735
+ "correct": 2,
736
+ "concordance": 1.0,
737
+ "tier_distribution": {
738
+ "Likely Benign": 2
739
+ }
740
+ },
741
+ {
742
+ "key": "ARSB",
743
+ "n": 2,
744
+ "correct": 2,
745
+ "concordance": 1.0,
746
+ "tier_distribution": {
747
+ "Pathogenic": 1,
748
+ "Likely Pathogenic": 1
749
+ }
750
+ },
751
+ {
752
+ "key": "AGXT",
753
+ "n": 2,
754
+ "correct": 2,
755
+ "concordance": 1.0,
756
+ "tier_distribution": {
757
+ "Likely Pathogenic": 1,
758
+ "Uncertain Significance": 1
759
+ }
760
+ },
761
+ {
762
+ "key": "DNAJC13",
763
+ "n": 2,
764
+ "correct": 2,
765
+ "concordance": 1.0,
766
+ "tier_distribution": {
767
+ "Uncertain Significance": 1,
768
+ "Benign": 1
769
+ }
770
+ },
771
+ {
772
+ "key": "ABCC8",
773
+ "n": 2,
774
+ "correct": 1,
775
+ "concordance": 0.5,
776
+ "tier_distribution": {
777
+ "Likely Pathogenic": 1,
778
+ "Likely Benign": 1
779
+ }
780
+ },
781
+ {
782
+ "key": "EDA",
783
+ "n": 2,
784
+ "correct": 2,
785
+ "concordance": 1.0,
786
+ "tier_distribution": {
787
+ "Likely Pathogenic": 2
788
+ }
789
+ },
790
+ {
791
+ "key": "ABCC9",
792
+ "n": 2,
793
+ "correct": 1,
794
+ "concordance": 0.5,
795
+ "tier_distribution": {
796
+ "Likely Benign": 1,
797
+ "Pathogenic": 1
798
+ }
799
+ },
800
+ {
801
+ "key": "ATP7A",
802
+ "n": 2,
803
+ "correct": 2,
804
+ "concordance": 1.0,
805
+ "tier_distribution": {
806
+ "Benign": 2
807
+ }
808
+ },
809
+ {
810
+ "key": "P2RY12",
811
+ "n": 2,
812
+ "correct": 2,
813
+ "concordance": 1.0,
814
+ "tier_distribution": {
815
+ "Benign": 1,
816
+ "Likely Benign": 1
817
+ }
818
+ },
819
+ {
820
+ "key": "CHD2",
821
+ "n": 2,
822
+ "correct": 2,
823
+ "concordance": 1.0,
824
+ "tier_distribution": {
825
+ "Pathogenic": 2
826
+ }
827
+ },
828
+ {
829
+ "key": "AKAP9",
830
+ "n": 2,
831
+ "correct": 2,
832
+ "concordance": 1.0,
833
+ "tier_distribution": {
834
+ "Likely Benign": 2
835
+ }
836
+ },
837
+ {
838
+ "key": "B9D1",
839
+ "n": 2,
840
+ "correct": 1,
841
+ "concordance": 0.5,
842
+ "tier_distribution": {
843
+ "Uncertain Significance": 1,
844
+ "Benign": 1
845
+ }
846
+ },
847
+ {
848
+ "key": "PSAP",
849
+ "n": 2,
850
+ "correct": 2,
851
+ "concordance": 1.0,
852
+ "tier_distribution": {
853
+ "Likely Pathogenic": 1,
854
+ "Pathogenic": 1
855
+ }
856
+ },
857
+ {
858
+ "key": "BBS2",
859
+ "n": 2,
860
+ "correct": 2,
861
+ "concordance": 1.0,
862
+ "tier_distribution": {
863
+ "Pathogenic": 1,
864
+ "Likely Pathogenic": 1
865
+ }
866
+ },
867
+ {
868
+ "key": "NPHS1",
869
+ "n": 2,
870
+ "correct": 2,
871
+ "concordance": 1.0,
872
+ "tier_distribution": {
873
+ "Benign": 1,
874
+ "Pathogenic": 1
875
+ }
876
+ },
877
+ {
878
+ "key": "ECHS1",
879
+ "n": 2,
880
+ "correct": 2,
881
+ "concordance": 1.0,
882
+ "tier_distribution": {
883
+ "Uncertain Significance": 1,
884
+ "Benign": 1
885
+ }
886
+ },
887
+ {
888
+ "key": "MRE11",
889
+ "n": 2,
890
+ "correct": 2,
891
+ "concordance": 1.0,
892
+ "tier_distribution": {
893
+ "Likely Pathogenic": 1,
894
+ "Uncertain Significance": 1
895
+ }
896
+ },
897
+ {
898
+ "key": "SMARCC2",
899
+ "n": 2,
900
+ "correct": 1,
901
+ "concordance": 0.5,
902
+ "tier_distribution": {
903
+ "Uncertain Significance": 1,
904
+ "Pathogenic": 1
905
+ }
906
+ },
907
+ {
908
+ "key": "PARN",
909
+ "n": 2,
910
+ "correct": 2,
911
+ "concordance": 1.0,
912
+ "tier_distribution": {
913
+ "Likely Pathogenic": 1,
914
+ "Benign": 1
915
+ }
916
+ },
917
+ {
918
+ "key": "SMAD2",
919
+ "n": 2,
920
+ "correct": 2,
921
+ "concordance": 1.0,
922
+ "tier_distribution": {
923
+ "Likely Benign": 1,
924
+ "Benign": 1
925
+ }
926
+ },
927
+ {
928
+ "key": "VPS4A",
929
+ "n": 2,
930
+ "correct": 2,
931
+ "concordance": 1.0,
932
+ "tier_distribution": {
933
+ "Likely Pathogenic": 1,
934
+ "Benign": 1
935
+ }
936
+ },
937
+ {
938
+ "key": "AEBP1",
939
+ "n": 2,
940
+ "correct": 2,
941
+ "concordance": 1.0,
942
+ "tier_distribution": {
943
+ "Likely Pathogenic": 1,
944
+ "Benign": 1
945
+ }
946
+ },
947
+ {
948
+ "key": "SLC25A13",
949
+ "n": 2,
950
+ "correct": 2,
951
+ "concordance": 1.0,
952
+ "tier_distribution": {
953
+ "Likely Pathogenic": 1,
954
+ "Pathogenic": 1
955
+ }
956
+ },
957
+ {
958
+ "key": "ROBO1",
959
+ "n": 2,
960
+ "correct": 2,
961
+ "concordance": 1.0,
962
+ "tier_distribution": {
963
+ "Likely Pathogenic": 1,
964
+ "Likely Benign": 1
965
+ }
966
+ },
967
+ {
968
+ "key": "TRIOBP",
969
+ "n": 2,
970
+ "correct": 2,
971
+ "concordance": 1.0,
972
+ "tier_distribution": {
973
+ "Pathogenic": 2
974
+ }
975
+ },
976
+ {
977
+ "key": "FANCF",
978
+ "n": 2,
979
+ "correct": 2,
980
+ "concordance": 1.0,
981
+ "tier_distribution": {
982
+ "Benign": 1,
983
+ "Pathogenic": 1
984
+ }
985
+ },
986
+ {
987
+ "key": "MAG",
988
+ "n": 2,
989
+ "correct": 1,
990
+ "concordance": 0.5,
991
+ "tier_distribution": {
992
+ "Benign": 1,
993
+ "Uncertain Significance": 1
994
+ }
995
+ },
996
+ {
997
+ "key": "MAX",
998
+ "n": 2,
999
+ "correct": 2,
1000
+ "concordance": 1.0,
1001
+ "tier_distribution": {
1002
+ "Likely Benign": 1,
1003
+ "Pathogenic": 1
1004
+ }
1005
+ },
1006
+ {
1007
+ "key": "MED25",
1008
+ "n": 2,
1009
+ "correct": 2,
1010
+ "concordance": 1.0,
1011
+ "tier_distribution": {
1012
+ "Uncertain Significance": 1,
1013
+ "Benign": 1
1014
+ }
1015
+ },
1016
+ {
1017
+ "key": "ETFDH",
1018
+ "n": 2,
1019
+ "correct": 1,
1020
+ "concordance": 0.5,
1021
+ "tier_distribution": {
1022
+ "Likely Pathogenic": 1,
1023
+ "Uncertain Significance": 1
1024
+ }
1025
+ },
1026
+ {
1027
+ "key": "BMPR2",
1028
+ "n": 2,
1029
+ "correct": 2,
1030
+ "concordance": 1.0,
1031
+ "tier_distribution": {
1032
+ "Pathogenic": 2
1033
+ }
1034
+ },
1035
+ {
1036
+ "key": "MPDZ",
1037
+ "n": 2,
1038
+ "correct": 2,
1039
+ "concordance": 1.0,
1040
+ "tier_distribution": {
1041
+ "Uncertain Significance": 1,
1042
+ "Pathogenic": 1
1043
+ }
1044
+ }
1045
+ ],
1046
+ "per_variant_type": [
1047
+ {
1048
+ "key": "missense_or_silent",
1049
+ "n": 658,
1050
+ "correct": 547,
1051
+ "concordance": 0.831306990881459,
1052
+ "tier_distribution": {
1053
+ "Likely Benign": 151,
1054
+ "Uncertain Significance": 185,
1055
+ "Benign": 105,
1056
+ "Likely Pathogenic": 101,
1057
+ "Pathogenic": 116
1058
+ }
1059
+ },
1060
+ {
1061
+ "key": "splice_region",
1062
+ "n": 182,
1063
+ "correct": 177,
1064
+ "concordance": 0.9725274725274725,
1065
+ "tier_distribution": {
1066
+ "Likely Pathogenic": 69,
1067
+ "Likely Benign": 38,
1068
+ "Benign": 60,
1069
+ "Pathogenic": 13,
1070
+ "Uncertain Significance": 2
1071
+ }
1072
+ },
1073
+ {
1074
+ "key": "inframe_del",
1075
+ "n": 69,
1076
+ "correct": 64,
1077
+ "concordance": 0.927536231884058,
1078
+ "tier_distribution": {
1079
+ "Pathogenic": 46,
1080
+ "Likely Pathogenic": 20,
1081
+ "Likely Benign": 1,
1082
+ "Uncertain Significance": 1,
1083
+ "Benign": 1
1084
+ }
1085
+ },
1086
+ {
1087
+ "key": "other",
1088
+ "n": 51,
1089
+ "correct": 48,
1090
+ "concordance": 0.9411764705882353,
1091
+ "tier_distribution": {
1092
+ "Benign": 26,
1093
+ "Likely Pathogenic": 4,
1094
+ "Likely Benign": 8,
1095
+ "Uncertain Significance": 11,
1096
+ "Pathogenic": 2
1097
+ }
1098
+ },
1099
+ {
1100
+ "key": "inframe_ins",
1101
+ "n": 31,
1102
+ "correct": 30,
1103
+ "concordance": 0.967741935483871,
1104
+ "tier_distribution": {
1105
+ "Pathogenic": 22,
1106
+ "Likely Pathogenic": 6,
1107
+ "Benign": 2,
1108
+ "Likely Benign": 1
1109
+ }
1110
+ },
1111
+ {
1112
+ "key": "synonymous",
1113
+ "n": 2,
1114
+ "correct": 2,
1115
+ "concordance": 1.0,
1116
+ "tier_distribution": {
1117
+ "Benign": 2
1118
+ }
1119
+ }
1120
+ ],
1121
+ "per_review_status": [
1122
+ {
1123
+ "key": "?",
1124
+ "n": 993,
1125
+ "correct": 868,
1126
+ "concordance": 0.8741188318227593,
1127
+ "tier_distribution": {
1128
+ "Likely Benign": 199,
1129
+ "Uncertain Significance": 199,
1130
+ "Pathogenic": 199,
1131
+ "Benign": 196,
1132
+ "Likely Pathogenic": 200
1133
+ }
1134
+ }
1135
+ ],
1136
+ "per_gene_per_tier": []
1137
+ }
scripts/per_gene_breakdown.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Slice a validation report by gene and other axes for stratified analysis.
2
+
3
+ Reads docs/clinical_validation_results_1000.json (or any structurally-
4
+ identical report) and emits per-gene, per-variant-type, and per-tier
5
+ breakdowns. This is what a reviewer / lab director will ask for after
6
+ seeing the headline number — "great, but how does it do on BRCA1?"
7
+
8
+ Usage
9
+ -----
10
+ python -m scripts.per_gene_breakdown \\
11
+ --in docs/clinical_validation_results_1000.json \\
12
+ --out docs/per_gene_breakdown_1000.json \\
13
+ --top 25
14
+
15
+ Outputs both:
16
+ - A JSON file with the full per-gene/per-type/per-tier slice
17
+ - A human-readable table on stdout for quick inspection
18
+ """
19
+
20
+ from __future__ import annotations
21
+
22
+ import argparse
23
+ import json
24
+ import re
25
+ import sys
26
+ from collections import Counter, defaultdict
27
+ from pathlib import Path
28
+ from typing import Any
29
+
30
+
31
+ def _class(label: str) -> str:
32
+ """Collapse 5-tier to P/VUS/B class for adjacent-tier metric."""
33
+ if label in ("Pathogenic", "Likely Pathogenic"):
34
+ return "P"
35
+ if label in ("Benign", "Likely Benign"):
36
+ return "B"
37
+ return "VUS"
38
+
39
+
40
+ def _variant_type(hgvs: str) -> str:
41
+ """Heuristic categorization from the HGVS string."""
42
+ h = hgvs.lower()
43
+ if "del" in h and "_" in h:
44
+ return "inframe_del"
45
+ if "dup" in h or ("ins" in h and "_" in h):
46
+ return "inframe_ins"
47
+ if h.endswith("=") or "p.=" in h:
48
+ return "synonymous"
49
+ if re.search(r"c\.\d+[+-]\d+", h):
50
+ return "splice_region"
51
+ if re.search(r"c\.\d+[acgt]>[acgt]", h):
52
+ return "missense_or_silent"
53
+ return "other"
54
+
55
+
56
+ def per_axis_table(
57
+ results: list[dict[str, Any]],
58
+ key_fn,
59
+ min_n: int = 2,
60
+ ) -> list[dict[str, Any]]:
61
+ """Group results by key_fn(row), compute class-level concordance per group."""
62
+ groups: dict[str, list[dict[str, Any]]] = defaultdict(list)
63
+ for r in results:
64
+ if r.get("got") == "ERROR":
65
+ continue
66
+ groups[key_fn(r)].append(r)
67
+
68
+ rows: list[dict[str, Any]] = []
69
+ for key, items in groups.items():
70
+ if len(items) < min_n:
71
+ continue
72
+ correct = sum(1 for r in items if _class(r["expected"]) == _class(r["got"]))
73
+ # Per-class breakdown within the group
74
+ per_tier = Counter(r["expected"] for r in items)
75
+ rows.append({
76
+ "key": key,
77
+ "n": len(items),
78
+ "correct": correct,
79
+ "concordance": correct / len(items) if items else 0.0,
80
+ "tier_distribution": dict(per_tier),
81
+ })
82
+ return sorted(rows, key=lambda r: -r["n"])
83
+
84
+
85
+ def per_gene_per_tier_table(results: list[dict[str, Any]], min_n: int = 5) -> list[dict[str, Any]]:
86
+ """For each (gene, expected_tier) combo, report concordance. Lets the
87
+ operator see *"how does BRCA1 do on its pathogenic variants specifically?"*"""
88
+ groups: dict[tuple[str, str], list[dict[str, Any]]] = defaultdict(list)
89
+ for r in results:
90
+ if r.get("got") == "ERROR":
91
+ continue
92
+ groups[(r.get("gene") or "?", r["expected"])].append(r)
93
+ rows = []
94
+ for (gene, tier), items in groups.items():
95
+ if len(items) < min_n:
96
+ continue
97
+ correct = sum(1 for r in items if _class(r["expected"]) == _class(r["got"]))
98
+ rows.append({
99
+ "gene": gene,
100
+ "tier": tier,
101
+ "n": len(items),
102
+ "correct": correct,
103
+ "concordance": correct / len(items),
104
+ })
105
+ return sorted(rows, key=lambda r: (r["gene"], r["tier"]))
106
+
107
+
108
+ def main() -> int:
109
+ parser = argparse.ArgumentParser()
110
+ parser.add_argument(
111
+ "--in",
112
+ dest="in_path",
113
+ type=Path,
114
+ default=Path("docs/clinical_validation_results_1000.json"),
115
+ )
116
+ parser.add_argument(
117
+ "--out",
118
+ type=Path,
119
+ default=Path("docs/per_gene_breakdown_1000.json"),
120
+ )
121
+ parser.add_argument(
122
+ "--top", type=int, default=25,
123
+ help="How many top-N rows to print per table (full output goes to JSON).",
124
+ )
125
+ parser.add_argument(
126
+ "--min-n", type=int, default=2,
127
+ help="Minimum variants per group to include (avoids noise from 1-variant groups).",
128
+ )
129
+ args = parser.parse_args()
130
+
131
+ data = json.loads(args.in_path.read_text())
132
+ results = data.get("results", [])
133
+ print(f"Loaded {len(results)} results from {args.in_path}")
134
+ print(f"Headline concordance: {data.get('concordance', 0):.1%}")
135
+ print()
136
+
137
+ by_gene = per_axis_table(results, lambda r: r.get("gene") or "?", min_n=args.min_n)
138
+ by_type = per_axis_table(results, lambda r: _variant_type(r.get("hgvs") or ""))
139
+ by_review = per_axis_table(
140
+ results, lambda r: r.get("review_status") or "?",
141
+ )
142
+
143
+ # --- print top genes ---
144
+ print(f"Per-gene concordance (top {args.top} by variant count):")
145
+ print(f" {'gene':12s} {'n':>4s} {'correct':>8s} {'concordance':>13s}")
146
+ for row in by_gene[: args.top]:
147
+ marker = "!" if row["concordance"] < 0.80 else " "
148
+ print(f"{marker} {row['key']:12s} {row['n']:4d} {row['correct']:8d} "
149
+ f"{row['concordance']:13.1%}")
150
+ weak = [r for r in by_gene if r["concordance"] < 0.80 and r["n"] >= 5]
151
+ if weak:
152
+ print(f"\nGenes with concordance < 80% (n ≥ 5) — investigate first:")
153
+ for row in weak:
154
+ print(f" {row['key']:12s} {row['n']:4d} variants {row['concordance']:6.1%}")
155
+
156
+ print()
157
+ print("Per-variant-type concordance:")
158
+ print(f" {'type':22s} {'n':>4s} {'correct':>8s} {'concordance':>13s}")
159
+ for row in by_type:
160
+ print(f" {row['key']:22s} {row['n']:4d} {row['correct']:8d} "
161
+ f"{row['concordance']:13.1%}")
162
+
163
+ print()
164
+ print("Per-review-status concordance:")
165
+ print(f" {'review':55s} {'n':>4s} {'concordance':>13s}")
166
+ for row in by_review:
167
+ print(f" {row['key']:55s} {row['n']:4d} {row['concordance']:13.1%}")
168
+
169
+ # --- write full JSON ---
170
+ out_payload = {
171
+ "source_report": str(args.in_path),
172
+ "headline_concordance": data.get("concordance"),
173
+ "total_scored": data.get("total_scored"),
174
+ "per_gene": by_gene,
175
+ "per_variant_type": by_type,
176
+ "per_review_status": by_review,
177
+ "per_gene_per_tier": per_gene_per_tier_table(results, min_n=5),
178
+ }
179
+ args.out.parent.mkdir(parents=True, exist_ok=True)
180
+ args.out.write_text(json.dumps(out_payload, indent=2) + "\n")
181
+ print(f"\nFull breakdown written to {args.out}")
182
+ return 0
183
+
184
+
185
+ if __name__ == "__main__":
186
+ sys.exit(main())