phanerozoic commited on
Commit
fdfd18e
·
verified ·
1 Parent(s): 8e667db

Remove version label; flatten phrasing

Browse files
Files changed (1) hide show
  1. ADVERSARIAL.md +11 -10
ADVERSARIAL.md CHANGED
@@ -10,7 +10,7 @@ detector cannot distinguish from genuine, because the two have the same expected
10
 
11
  Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
12
  and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
13
- both the detector order k and the adversary order m gives the boundary, measured on benchmark v2:
14
 
15
  | detector | adversary m=0 | m=1 | m=2 | m=3 | m=4 | m=5 |
16
  |---|---|---|---|---|---|---|
@@ -23,9 +23,10 @@ both the detector order k and the adversary order m gives the boundary, measured
23
  ## Result
24
 
25
  Each detector collapses to chance exactly when the adversary reaches its order: the k=2 detector
26
- breaks at m=1, k=4 at m=3, and k=6 at m=5. The staircase is the sufficient-statistic prediction
27
- made visible. The hexamer detector this model uses is blind to an adversary who matches the
28
- order-5 composition of human DNA (AUROC 0.53 at m=5).
 
29
 
30
  ## The neural model is not evaded
31
 
@@ -41,11 +42,11 @@ separates it from real human across every order, exactly where composition fails
41
  | 6 | 0.52 | 1.00 |
42
  | 7 | 0.52 | 1.00 |
43
 
44
- The order-5-matched construct is invisible to the hexamer detector (0.53) and obvious to the model
45
- (1.00). Even an order-7 match, reproducing every 8-mer frequency of human DNA, is caught at 0.997,
46
- because the model reads long-range structure, codon-pair grammar, gene organization, and motif
47
- context, that no fixed-order composition encodes. The model's value here is precisely adversarial
48
- robustness against the evasion composition cannot resist.
49
 
50
  ## Implication for biosecurity screening
51
 
@@ -57,7 +58,7 @@ adversary must clear (the k=6 detector forces an order-5 match, which constrains
57
  than an order-1 match), but it never closes the gap, and higher k costs data and invites
58
  overfitting. Detecting an order-(k-1)-matched adversary requires signal that is not in global
59
  composition at all: per-position, context-dependent modeling of the kind a neural sequence model
60
- provides, which is where composition methods stop and learned models earn their place.
61
 
62
  This boundary is a property of the method, not of any particular trained weights, and it applies
63
  equally to other composition-based detectors.
 
10
 
11
  Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
12
  and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
13
+ both the detector order k and the adversary order m gives the boundary:
14
 
15
  | detector | adversary m=0 | m=1 | m=2 | m=3 | m=4 | m=5 |
16
  |---|---|---|---|---|---|---|
 
23
  ## Result
24
 
25
  Each detector collapses to chance exactly when the adversary reaches its order: the k=2 detector
26
+ breaks at m=1, k=4 at m=3, and k=6 at m=5, matching the sufficient-statistic account: a detector
27
+ reading k-mer counts cannot separate sequence whose order-(k-1) statistics have been reproduced.
28
+ The hexamer detector this model uses is at chance against an adversary that matches the order-5
29
+ composition of human DNA (AUROC 0.53 at m=5).
30
 
31
  ## The neural model is not evaded
32
 
 
42
  | 6 | 0.52 | 1.00 |
43
  | 7 | 0.52 | 1.00 |
44
 
45
+ At order 5 the hexamer detector is at chance (0.53) while the model separates the same sequences at
46
+ 1.00. At order 7, which reproduces every 8-mer frequency of human DNA, the model still scores 0.997,
47
+ because it reads long-range structure (codon-pair grammar, gene organization, motif context) that no
48
+ fixed-order composition encodes. Where composition loses discrimination at high adversary order, the
49
+ model retains it.
50
 
51
  ## Implication for biosecurity screening
52
 
 
58
  than an order-1 match), but it never closes the gap, and higher k costs data and invites
59
  overfitting. Detecting an order-(k-1)-matched adversary requires signal that is not in global
60
  composition at all: per-position, context-dependent modeling of the kind a neural sequence model
61
+ provides, which is where composition methods stop and a learned model is required.
62
 
63
  This boundary is a property of the method, not of any particular trained weights, and it applies
64
  equally to other composition-based detectors.