phanerozoic
/

dna-origin-classifier

@@ -10,7 +10,7 @@ detector cannot distinguish from genuine, because the two have the same expected
 Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
 and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
-both the detector order k and the adversary order m gives the boundary, measured on benchmark v2:
 | detector | adversary m=0 | m=1 | m=2 | m=3 | m=4 | m=5 |
 |---|---|---|---|---|---|---|
@@ -23,9 +23,10 @@ both the detector order k and the adversary order m gives the boundary, measured
 ## Result
 Each detector collapses to chance exactly when the adversary reaches its order: the k=2 detector
-breaks at m=1, k=4 at m=3, and k=6 at m=5. The staircase is the sufficient-statistic prediction
-made visible. The hexamer detector this model uses is blind to an adversary who matches the
-order-5 composition of human DNA (AUROC 0.53 at m=5).
 ## The neural model is not evaded
@@ -41,11 +42,11 @@ separates it from real human across every order, exactly where composition fails
 | 6 | 0.52 | 1.00 |
 | 7 | 0.52 | 1.00 |
-The order-5-matched construct is invisible to the hexamer detector (0.53) and obvious to the model
-(1.00). Even an order-7 match, reproducing every 8-mer frequency of human DNA, is caught at 0.997,
-because the model reads long-range structure, codon-pair grammar, gene organization, and motif
-context, that no fixed-order composition encodes. The model's value here is precisely adversarial
-robustness against the evasion composition cannot resist.
 ## Implication for biosecurity screening
@@ -57,7 +58,7 @@ adversary must clear (the k=6 detector forces an order-5 match, which constrains
 than an order-1 match), but it never closes the gap, and higher k costs data and invites
 overfitting. Detecting an order-(k-1)-matched adversary requires signal that is not in global
 composition at all: per-position, context-dependent modeling of the kind a neural sequence model
-provides, which is where composition methods stop and learned models earn their place.
 This boundary is a property of the method, not of any particular trained weights, and it applies
 equally to other composition-based detectors.

 Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
 and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
+both the detector order k and the adversary order m gives the boundary:
 | detector | adversary m=0 | m=1 | m=2 | m=3 | m=4 | m=5 |
 |---|---|---|---|---|---|---|
 ## Result
 Each detector collapses to chance exactly when the adversary reaches its order: the k=2 detector
+breaks at m=1, k=4 at m=3, and k=6 at m=5, matching the sufficient-statistic account: a detector
+reading k-mer counts cannot separate sequence whose order-(k-1) statistics have been reproduced.
+The hexamer detector this model uses is at chance against an adversary that matches the order-5
+composition of human DNA (AUROC 0.53 at m=5).
 ## The neural model is not evaded
 | 6 | 0.52 | 1.00 |
 | 7 | 0.52 | 1.00 |
+At order 5 the hexamer detector is at chance (0.53) while the model separates the same sequences at
+1.00. At order 7, which reproduces every 8-mer frequency of human DNA, the model still scores 0.997,
+because it reads long-range structure (codon-pair grammar, gene organization, motif context) that no
+fixed-order composition encodes. Where composition loses discrimination at high adversary order, the
+model retains it.
 ## Implication for biosecurity screening
 than an order-1 match), but it never closes the gap, and higher k costs data and invites
 overfitting. Detecting an order-(k-1)-matched adversary requires signal that is not in global
 composition at all: per-position, context-dependent modeling of the kind a neural sequence model
+provides, which is where composition methods stop and a learned model is required.
 This boundary is a property of the method, not of any particular trained weights, and it applies
 equally to other composition-based detectors.