Ship single k=8 discriminative model (host 0.993/0.990, engineered 0.919/0.896, 5-class 0.708); breaks adversary at order 7
b8f10aa verified | # Adversarial robustness: the order-(k-1) evasion boundary | |
| Composition-based DNA classifiers, including this one and other homology-free engineered-sequence | |
| detectors, reduce to a k-mer frequency statistic. A detector that reads k-mer counts is an | |
| order-(k-1) sufficient statistic, which has a direct security consequence: an adversary who | |
| reproduces the order-(k-1) composition of the target class produces sequence the detector cannot | |
| separate from genuine, because the two have the same expected k-mer spectrum. | |
| ## Test | |
| Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it, | |
| and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping | |
| both the detector word length k and the adversary order m gives the boundary: | |
| | detector | m=0 | m=1 | m=2 | m=3 | m=4 | m=5 | m=6 | m=7 | | |
| |---|---|---|---|---|---|---|---|---| | |
| | k=4 | 1.00 | 0.97 | 0.87 | 0.51 | 0.50 | 0.50 | 0.50 | 0.50 | | |
| | k=6 | 1.00 | 0.98 | 0.92 | 0.79 | 0.66 | 0.52 | 0.58 | 0.52 | | |
| | **k=8 (this model)** | 1.00 | 0.97 | 0.93 | 0.88 | 0.82 | 0.80 | 0.72 | 0.55 | | |
| (AUROC, real human vs order-m-matched synthetic.) | |
| ## Result | |
| Each detector reaches chance when the adversary matches its order: k=4 breaks at m=3, k=6 at m=5, | |
| and k=8 at m=7. This model uses 8-mers, so evading it by composition matching requires reproducing | |
| the order-7 statistics of the target class, which fixes every 8-mer frequency. Lower-order forgeries, | |
| including anything that matches only hexamer or shorter composition, are caught. Longer words push | |
| the bar higher at the cost of more parameters and data. | |
| ## The neural model is not evaded | |
| Scoring the same order-m-matched synthetic human with Carbon-8B (zero-shot per-base likelihood) | |
| separates it from real human at every order, including the order where this model reaches chance: | |
| | adversary order m | this model (k=8) | Carbon-8B | | |
| |---|---|---| | |
| | 5 | 0.80 | 1.00 | | |
| | 6 | 0.72 | 1.00 | | |
| | 7 | 0.55 | 1.00 | | |
| | 8 | 0.54 | 1.00 | | |
| At order 7 the k=8 detector is at chance while Carbon-8B holds at 1.00, because the model reads | |
| long-range structure (codon-pair grammar, gene organization, motif context) that no fixed-order | |
| composition encodes. Where composition runs out at high adversary order, the model still separates. | |
| ## Implication for biosecurity screening | |
| Homology-free, composition-based screening has an inherent evasion boundary. It catches naive | |
| recoding and composition that drifts from the target, but by construction it cannot flag a | |
| construct matched to the order-(k-1) statistics of a natural class. Raising k raises the bar the | |
| adversary must clear; this model's 8-mers force an order-7 match. Detecting an order-(k-1)-matched | |
| adversary requires signal that is not in global composition at all: per-position, context-dependent | |
| modeling of the kind a neural sequence model provides. This boundary is a property of the method, | |
| and it applies equally to other composition-based detectors. | |