Ship single k=8 discriminative model (host 0.993/0.990, engineered 0.919/0.896, 5-class 0.708); breaks adversary at order 7

b8f10aa verified 5 days ago

preview code

raw

history blame contribute delete

2.98 kB

	# Adversarial robustness: the order-(k-1) evasion boundary

	Composition-based DNA classifiers, including this one and other homology-free engineered-sequence
	detectors, reduce to a k-mer frequency statistic. A detector that reads k-mer counts is an
	order-(k-1) sufficient statistic, which has a direct security consequence: an adversary who
	reproduces the order-(k-1) composition of the target class produces sequence the detector cannot
	separate from genuine, because the two have the same expected k-mer spectrum.

	## Test

	Fit an order-m Markov model to real human coding sequence, generate synthetic sequence from it,
	and measure whether a k-mer detector separates real human from the order-m synthetic. Sweeping
	both the detector word length k and the adversary order m gives the boundary:

	\| detector \| m=0 \| m=1 \| m=2 \| m=3 \| m=4 \| m=5 \| m=6 \| m=7 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| k=4 \| 1.00 \| 0.97 \| 0.87 \| 0.51 \| 0.50 \| 0.50 \| 0.50 \| 0.50 \|
	\| k=6 \| 1.00 \| 0.98 \| 0.92 \| 0.79 \| 0.66 \| 0.52 \| 0.58 \| 0.52 \|
	\| k=8 (this model) \| 1.00 \| 0.97 \| 0.93 \| 0.88 \| 0.82 \| 0.80 \| 0.72 \| 0.55 \|

	(AUROC, real human vs order-m-matched synthetic.)

	## Result

	Each detector reaches chance when the adversary matches its order: k=4 breaks at m=3, k=6 at m=5,
	and k=8 at m=7. This model uses 8-mers, so evading it by composition matching requires reproducing
	the order-7 statistics of the target class, which fixes every 8-mer frequency. Lower-order forgeries,
	including anything that matches only hexamer or shorter composition, are caught. Longer words push
	the bar higher at the cost of more parameters and data.

	## The neural model is not evaded

	Scoring the same order-m-matched synthetic human with Carbon-8B (zero-shot per-base likelihood)
	separates it from real human at every order, including the order where this model reaches chance:

	\| adversary order m \| this model (k=8) \| Carbon-8B \|
	\|---\|---\|---\|
	\| 5 \| 0.80 \| 1.00 \|
	\| 6 \| 0.72 \| 1.00 \|
	\| 7 \| 0.55 \| 1.00 \|
	\| 8 \| 0.54 \| 1.00 \|

	At order 7 the k=8 detector is at chance while Carbon-8B holds at 1.00, because the model reads
	long-range structure (codon-pair grammar, gene organization, motif context) that no fixed-order
	composition encodes. Where composition runs out at high adversary order, the model still separates.

	## Implication for biosecurity screening

	Homology-free, composition-based screening has an inherent evasion boundary. It catches naive
	recoding and composition that drifts from the target, but by construction it cannot flag a
	construct matched to the order-(k-1) statistics of a natural class. Raising k raises the bar the
	adversary must clear; this model's 8-mers force an order-7 match. Detecting an order-(k-1)-matched
	adversary requires signal that is not in global composition at all: per-position, context-dependent
	modeling of the kind a neural sequence model provides. This boundary is a property of the method,
	and it applies equally to other composition-based detectors.