Drop unreproducible design-distance multiple; correct undatabased non-host rate to 99%

caba6d4 verified 3 days ago

20.3 kB

	---
	license: mit
	pipeline_tag: text-classification
	library_name: safetensors
	base_model:
	- HuggingFaceBio/Carbon-8B
	datasets:
	- phanerozoic/dna-origin-benchmark
	metrics:
	- roc_auc
	- accuracy
	tags:
	- safetensors
	- genomics
	- dna
	- k-mer
	- sequence-classification
	- reference-free
	- interpretable
	model-index:
	- name: dna-origin-classifier
	results:
	- task:
	type: text-classification
	name: Human vs non-host detection
	dataset:
	name: dna-origin-benchmark
	type: phanerozoic/dna-origin-benchmark
	split: test
	metrics:
	- type: roc_auc
	value: 0.993
	name: AUROC (test)
	- type: roc_auc
	value: 0.990
	name: AUROC (novel taxa)
	- task:
	type: text-classification
	name: Engineered vs natural detection
	dataset:
	name: dna-origin-benchmark
	type: phanerozoic/dna-origin-benchmark
	split: test
	metrics:
	- type: roc_auc
	value: 0.919
	name: AUROC (test)
	- type: roc_auc
	value: 0.896
	name: AUROC (novel taxa)
	- task:
	type: text-classification
	name: Five-class origin
	dataset:
	name: dna-origin-benchmark
	type: phanerozoic/dna-origin-benchmark
	split: test
	metrics:
	- type: accuracy
	value: 0.708
	name: Accuracy (test, 5-class)
	---

	# dna-origin-classifier

	A reference-free, alignment-free classifier that labels a DNA sequence by its source: human, other
	eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors file, runs on a
	CPU at thousands of reads per second, consults no database, and because the score is an exact linear
	function of 8-mer counts it is also interpretable, invertible, and certifiable in closed form.

	The model is small because the thing it detects is small. Sequence identity, the answer to "what is
	this", compresses to a few hundred thousand linear weights, and a 2 MB model recovers what an 8B
	genomic language model knows about origin. Sequence function, the answer to "what does this do", does
	not compress the same way: the same compact form trained on variant effect reaches far below the large
	model. That asymmetry runs through this card. It is computable in advance from the data alone, it is
	machine-checked over the reals, and run across the benchmarks that genomic and protein foundation
	models are measured on, it separates the tasks those models are credited with into the ones a linear
	baseline already solves and the ones where the large model actually earns its scale. The classifier is
	the anchor; the boundary is the result.

	## Method

	The featurizer counts all 65,536 8-mers in a sequence, normalizes them to within-sequence frequency,
	and divides by a stored per-feature scale. Three discriminatively trained linear heads read that vector:

	- `origin`: five-class, human / eukaryote / bacteria / virus / engineered.
	- `host`: binary, human versus non-host (bacterial or viral).
	- `engineered`: binary, engineered or synthetic versus natural.

	The discriminative fit is what separates this from the classical generative k-mer log-odds: on the same
	features it raises five-class accuracy from 0.63 to 0.71. Every parameter lives in `model.safetensors`
	(`feature_scale`, `origin.weight/bias`, `host.weight/bias`, `engineered.weight/bias`), 524,295 numbers
	in total. There is nothing else: no embedding table, no nonlinearity, no reference index.

	## Usage

	```python
	from model import DnaOriginClassifier
	clf = DnaOriginClassifier("model.safetensors")

	seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"
	clf.classify(seq) # 'human' \| 'eukaryote' \| 'bacteria' \| 'virus' \| 'engineered'
	clf.host_score(seq) # higher = more human/host-like
	clf.engineered_score(seq) # higher = more likely engineered/synthetic

	clf.attribute(seq) # exact per-base contribution to the host score (sums to the score)
	clf.certify(seq) # base substitutions that flip the host call
	```

	The repository ships the model with four small tools, each requiring only `numpy` and `safetensors`:

	- `design.py` generates sequences that maximize or minimize a head, either from scratch or by
	synonymous codon choice with the encoded protein held fixed.
	- `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host depletion for
	pathogen enrichment, and human removal for privacy.
	- `boundary.py` computes the composition-solvability distance D for two classes of sequences, or for
	many, with no training. It is the criterion behind the boundary sections below.
	- `Detector.v` and `CompositionBoundary.v` are Rocq proofs, described where they are used.

	## Evaluation

	On the test and novel-taxa splits of
	[dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark) (24,950
	fragments, 382 organisms, cluster-level splits so that train and test share no near-duplicate cluster):

	\| task \| head \| test \| novel taxa \|
	\|---\|---\|---\|---\|
	\| human vs non-host \| `host` \| 0.993 \| 0.990 \|
	\| engineered vs natural \| `engineered` \| 0.919 \| 0.896 \|

	Five-class origin accuracy is 0.708 against a random baseline of 0.20. The novel-taxa column holds out
	entire clades, so the small drop from test to novel measures generalization to organisms never seen in
	training rather than memorization of databased sequence.

	## How it compares

	On the same splits and tasks:

	- Database tools. Kraken2 against RefSeq matches the model on databased taxa (host sensitivity
	0.972, specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% of reads and
	BLAST 6.6%, while this model calls 100%, because it needs no reference to consult.
	- Learned reference-free models. A fine-tuned 110M genomic language model (DNABERT) matches it on
	host detection (0.989 against 0.990 on novel clades); a CNN and a BiLSTM trained on the same data land
	below at 0.94 and 0.97. The linear model reaches that level at roughly 200 times fewer parameters, on
	a CPU.

	## What the linear form gives you

	Because the score is an exact linear function of 8-mer counts, the model supports operations a black-box
	classifier cannot do cleanly:

	- Certified robustness. `certify` returns, by greedy search with exact per-edit deltas, a count of
	base substitutions that flips a call, an upper bound on the true adversarial distance: a typical human
	read takes a median of 3 of 300 to read as non-host, an engineered sequence 4 to pass as natural. The
	matching lower bound is proved from the linear form and machine-checked in `Detector.v` (Rocq, over
	the rationals, no axioms and no `admit`): one substitution moves the score by at most 2k times the
	largest effective weight, so no edit within the margin divided by that constant can flip the call.
	- Exact attribution. `attribute` decomposes any call into per-base contributions that sum exactly to
	the score, with no gradient approximation. The same decomposition is the attribution-exactness identity
	in the proof file.
	- Inverse design. Coordinate ascent on a head designs sequences to order. From scratch it reaches
	host_score 22, far past the natural human ceiling of about 6. Holding a protein fixed, synonymous codon
	choice alone moves the same gene from host_score −15 to +14. The two heads give two synonymous channels
	at once, so a single gene stamps into any of the four (host, engineered) sign-quadrants with the protein
	unchanged and the detector reads the two bits back: a protein-preserving provenance watermark.
	- Weight arithmetic. Detectors compose in weight space. A host detector built purely by algebra on
	the origin rows, `human − ½(bacteria + virus)`, never trained as such, scores 0.992 against the trained
	head's 0.993.

	## The 8-mer atlas

	The model's learned weights are published as a standalone dataset,
	[dna-origin-atlas](https://huggingface.co/datasets/phanerozoic/dna-origin-atlas): all 65,536 8-mer
	weights for each head, annotated with GC and CpG content. The dominant axis is biological. CpG-bearing
	8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the table encodes the
	vertebrate CpG-depletion (methylation) signature as a number. The discrimination rests on bulk
	composition of this kind rather than on a lexicon of named regulatory motifs.

	## The partition: identity is shallow, function is deep

	One asymmetry runs through everything here. Sequence identity is shallow: it compresses to this 2 MB
	linear model, separable and invertible, reaching 0.99 AUROC on human versus non-host, and the learned
	origin space separates bacteria, eukaryotes, and viruses without a database. Sequence function is deep:
	a compact model trained on labeled variants reaches only 0.56 on variant effect, far below the 0.85 of
	the 8B language model, so the function signal does not compress from supervision the way identity does.

	![Organisms in the learned origin space](tree_of_life.png)

	*Origin-head weights alone separate bacteria, eukaryotes, and viruses in a PCA of the model's learned
	space; the single human point sits far from all three. No reference database is consulted.*

	The split is not specific to DNA. Building the same composition classifier on other molecules:

	\| molecule \| identity (composition) \| function (composition) \|
	\|---\|---\|---\|
	\| DNA \| 0.99 \| 0.56 \|
	\| protein \| 0.91 \| 0.73 \|
	\| RNA \| 1.00 \| n/a \|

	Identity compresses to composition across alphabets; function does not. The protein function entry reads
	0.73 rather than chance because the amino-acid substitution itself carries chemical signal, but its
	information fraction (0.14) stays below the boundary, and the context-dependent part is absent. The
	boundary is predictable in information terms: a task is composition-solvable when the mutual information
	between the k-mer spectrum and its label exceeds about 0.2 of the label entropy. Across ten tasks spanning
	DNA and protein, a single threshold near 0.20 separates every identity task from every function task, and
	the two cases whose side is not clear in advance land where it predicts. Human against other eukaryotes,
	the hardest identity task at 0.26, stays above; missense pathogenicity, the strongest function signal at
	0.14, stays below; pathogenicity read from the surrounding composition alone sits at chance.

	![The partition law across ten tasks](partition_law.png)

	*Composition AUROC against the captured information fraction. Identity tasks (blue) lie above the
	threshold, function tasks (red) below, with DNA and protein on the same axis.*

	Generation makes the separation concrete. Coordinate ascent on the host head designs sequences with
	host_score up to 22, against a natural ceiling of about 6, and these designs sit far outside the
	region of composition space occupied by any natural class, where the detector saturates but no
	biology lives. The 8B language model rates those same designs as
	increasingly unnatural, so composition and neural naturalness are separable axes rather than two views of
	one thing. `ADVERSARIAL.md` covers the order-(k-1) evasion boundary and why a language model resists it;
	`TOOL.md` covers footprint, throughput, read length, and operating modes.

	![The composition manifold](manifold.png)

	*Natural sequences of every origin class occupy one connected region of composition space (center);
	coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology lives.*

	## The composition boundary

	The partition has a boundary that can be computed before any training and a proof that it holds. View a
	length-L sequence as N = L - k + 1 draws from a class-conditional k-mer distribution, which is all a
	composition classifier ever sees. The Bhattacharyya coefficient between the two classes' count
	distributions then factorizes exactly to rho^N, with rho the per-k-mer overlap sum of
	sqrt(theta_0 theta_1); the Bayes error of any count-based classifier is at most half of rho^N; and for a
	balanced label the mutual information equals the Jensen-Shannon divergence between the count
	distributions. Solvability is therefore set by D = -N ln rho, read off the class distributions with no
	classifier. Equal distributions give rho = 1 and chance; rho below 1 gives an error that vanishes with
	length. These four statements are machine-checked in `CompositionBoundary.v` (Rocq, over Coq's standard
	reals, no `admit`).

	The first statement of the boundary treats overlapping k-mers as independent, which overstates
	separability where the sequence is correlated. The proof file also carries the correlated generalization:
	model the sequence as an order-1 Markov chain over states, with amp the initial Bhattacharyya amplitude
	and a transfer matrix whose entries are per-transition Bhattacharyya weights, and the path coefficient is
	the transfer-matrix evaluation, whose per-step growth is the top eigenvalue of that matrix rather than the
	scalar rho. The generalization reduces to rho^N exactly when the chain has no memory, which is the iid
	case, and that reduction is the machine-checked theorem. Recomputed at order 5, the distance tracks
	measured performance more closely (rank correlation with AUROC 0.92 against 0.87 for the iid form) and the
	worst overstatement, the enhancer tasks, falls from a distance of 9.0 to 2.3, in line with their 0.79
	AUROC.

	![Accounting for k-mer correlation tightens the criterion](markov_d.png)

	*Left, the iid distance against measured AUROC; right, the order-5 Markov distance. The correlated form
	pulls the over-separated tasks back toward the line and raises the rank correlation.*

	## A foundation model against the linear baseline

	Run the criterion across the sixteen binary Nucleotide Transformer tasks, the genomic benchmark these
	models are sold on, and D predicts task by task whether composition suffices. A fine-tuned 110M DNABERT
	bears it out. On the promoters, where D is high and the 2 MB linear model already reaches 0.96, DNABERT
	adds only 0.02 to 0.04: a near tie, with little room for any model. The splice donor and acceptor sites
	are the opposite extreme: at D near 1 the linear model reaches about 0.75 while DNABERT reaches 0.99, a
	gain near a quarter of an AUROC, because the motif that separates the classes is positional and sits
	beneath the composition a bag of k-mers reads. The histone marks fall in between, with gains of 0.05 to
	0.10 at intermediate D. Averaged over the benchmark, DNABERT's gain over the linear model is +0.097 below
	the boundary against +0.054 above it, and a full-data fine-tune on the decisive tasks (splice at 0.99,
	promoter at 0.99) reproduces the split. The foundation model's distinct capability concentrates where the
	criterion places it, scaling with how far below the boundary the task sits.

	![A foundation model against the 2 MB linear model](composition_boundary_fm.png)

	*Left, D against test AUROC for the linear model and DNABERT across the sixteen tasks; right, the
	DNABERT-minus-linear gain against D, concentrated below the threshold, splice sites foremost.*

	The same line holds across suites and molecules. D and the linear baseline reproduce the partition over the
	Genomic Benchmarks and the available GUE tasks as well as over the Nucleotide Transformer set, about thirty
	tasks in three suites. Real foundation models confirm the cross-molecule reading: amino-acid composition
	separates eukaryote from bacterium above the boundary (D 4.1, where an ESM probe adds 0.05) but cannot
	resolve subcellular localization below it (D_min 0.4, where ESM adds 0.26); for RNA, composition and an
	RNA-FM probe both clear the human-versus-yeast non-coding identity task above the boundary (D 15, RNA-FM
	ahead by 0.03). In every molecule the learned model adds little above the line and most of its value below
	it.

	![D predicts composition-solvability across suites and molecules](cross_suite.png)

	*Composition AUROC against D for DNA tasks from three suites plus the protein and RNA identity points, all
	on one axis.*

	A forward test fixes the direction. On a three-class splice dataset from a different source than any used to
	set the threshold, D was computed first and predicted that composition cannot separate any pair of classes
	(D between 0.13 and 0.22); the linear baseline then scored 0.53 to 0.66, confirming all three calls before a
	model was run. Splice sounds tractable, and the criterion calls its failure for composition in advance.

	The consequence is a re-scoring rather than a new tool. A sequence-classification benchmark can be ordered by
	D with no training; its above-boundary entries are k-mer baselines that a 2 MB model matches, and a
	foundation model's distinct contribution is whatever it adds on the entries below. Variant and fitness
	assays such as ProteinGym are mutant ensembles rather than composition samples, so they sit outside this
	model and in the deep regime by construction. The full GUE set, BEND, and an RNA function benchmark need
	data not mirrored on HuggingFace, so RNA enters here on the identity side.

	## Screening and hybrid use

	- Certified screening frontier. To pass an engineered construct as natural, an adversary must make a
	median of 4 base substitutions (46% need 5 or more) or reproduce the full order-7 composition of a
	natural class. Detection holds for 60% of constructs after the best 3-edit attack and for none after 9
	edits. Composition screening is therefore sound against light point edits and blind to a construct that
	reproduces natural composition, and the boundary states precisely where that blindness begins.
	- Strictly dominant hybrid. Composing this detector with a database method, taking the database call
	where it classifies and the composition call otherwise, reaches 0.994 correct on a mixed databased and
	undatabased set, above Kraken2 alone (0.337, blind to undatabased sequence) and this model alone (0.980),
	and by construction never worse than either component.

	## Calibration

	The heads output linear margins, not calibrated probabilities. Use the argmax of `logits` for origin and
	the ranking or sign of `host_score` and `engineered_score` for the binary tasks.

	## Training data

	All heads are fit on the train split of `dna-origin-benchmark`: RefSeq coding sequences across roughly 370
	taxa, NCBI synthetic-construct and vector records, synonymous recodings, and environmental metagenome
	fragments. Splits are drawn at the cluster level so that evaluation never sees a near-duplicate of a training
	sequence, and a novel-taxa split holds out whole clades.

	## Files in this repository

	- `model.safetensors`, `model.py`: the classifier and its loader.
	- `boundary.py`: the composition-solvability distance D, for two classes or for many, no training.
	- `Detector.v`: machine-checked certified robustness and exact attribution for the linear detector.
	- `CompositionBoundary.v`: the boundary theorem (the factorization, the Bayes-error bound, the
	Jensen-Shannon identity, the dichotomy) and its Markov-correlated generalization.
	- `design.py`, `dna_filter.py`: inverse design and the read filter.
	- `ADVERSARIAL.md`, `TOOL.md`: the evasion analysis and the deployment characteristics.
	- Figures: `tree_of_life.png`, `partition_law.png`, `analytical_crossover.png`, `manifold.png`,
	`composition_boundary_fm.png`, `cross_suite.png`, `markov_d.png`.

	## References

	- Base model (derived by analysis): [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B).
	This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces
	its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs
	and is fit to public NCBI sequence, hence MIT.
	- Benchmark and baselines: [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark)
	and [phanerozoic/dna-origin-atlas](https://huggingface.co/datasets/phanerozoic/dna-origin-atlas).
	- Foundation-model benchmarks compared against: the Nucleotide Transformer downstream tasks, the
	Genomic Benchmarks, GUE, with DNABERT, ESM-2, and RNA-FM as the learned models.
	- Background: k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007;
	genomic signatures, Karlin and Burge 1995).

	## License

	MIT.