Drop unreproducible design-distance multiple; correct undatabased non-host rate to 99%
caba6d4 verified | license: mit | |
| pipeline_tag: text-classification | |
| library_name: safetensors | |
| base_model: | |
| - HuggingFaceBio/Carbon-8B | |
| datasets: | |
| - phanerozoic/dna-origin-benchmark | |
| metrics: | |
| - roc_auc | |
| - accuracy | |
| tags: | |
| - safetensors | |
| - genomics | |
| - dna | |
| - k-mer | |
| - sequence-classification | |
| - reference-free | |
| - interpretable | |
| model-index: | |
| - name: dna-origin-classifier | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Human vs non-host detection | |
| dataset: | |
| name: dna-origin-benchmark | |
| type: phanerozoic/dna-origin-benchmark | |
| split: test | |
| metrics: | |
| - type: roc_auc | |
| value: 0.993 | |
| name: AUROC (test) | |
| - type: roc_auc | |
| value: 0.990 | |
| name: AUROC (novel taxa) | |
| - task: | |
| type: text-classification | |
| name: Engineered vs natural detection | |
| dataset: | |
| name: dna-origin-benchmark | |
| type: phanerozoic/dna-origin-benchmark | |
| split: test | |
| metrics: | |
| - type: roc_auc | |
| value: 0.919 | |
| name: AUROC (test) | |
| - type: roc_auc | |
| value: 0.896 | |
| name: AUROC (novel taxa) | |
| - task: | |
| type: text-classification | |
| name: Five-class origin | |
| dataset: | |
| name: dna-origin-benchmark | |
| type: phanerozoic/dna-origin-benchmark | |
| split: test | |
| metrics: | |
| - type: accuracy | |
| value: 0.708 | |
| name: Accuracy (test, 5-class) | |
| # dna-origin-classifier | |
| A reference-free, alignment-free classifier that labels a DNA sequence by its source: human, other | |
| eukaryote, bacterial, viral, or engineered/synthetic. It loads from a 2 MB safetensors file, runs on a | |
| CPU at thousands of reads per second, consults no database, and because the score is an exact linear | |
| function of 8-mer counts it is also interpretable, invertible, and certifiable in closed form. | |
| The model is small because the thing it detects is small. Sequence identity, the answer to "what is | |
| this", compresses to a few hundred thousand linear weights, and a 2 MB model recovers what an 8B | |
| genomic language model knows about origin. Sequence function, the answer to "what does this do", does | |
| not compress the same way: the same compact form trained on variant effect reaches far below the large | |
| model. That asymmetry runs through this card. It is computable in advance from the data alone, it is | |
| machine-checked over the reals, and run across the benchmarks that genomic and protein foundation | |
| models are measured on, it separates the tasks those models are credited with into the ones a linear | |
| baseline already solves and the ones where the large model actually earns its scale. The classifier is | |
| the anchor; the boundary is the result. | |
| ## Method | |
| The featurizer counts all 65,536 8-mers in a sequence, normalizes them to within-sequence frequency, | |
| and divides by a stored per-feature scale. Three discriminatively trained linear heads read that vector: | |
| - `origin`: five-class, human / eukaryote / bacteria / virus / engineered. | |
| - `host`: binary, human versus non-host (bacterial or viral). | |
| - `engineered`: binary, engineered or synthetic versus natural. | |
| The discriminative fit is what separates this from the classical generative k-mer log-odds: on the same | |
| features it raises five-class accuracy from 0.63 to 0.71. Every parameter lives in `model.safetensors` | |
| (`feature_scale`, `origin.weight/bias`, `host.weight/bias`, `engineered.weight/bias`), 524,295 numbers | |
| in total. There is nothing else: no embedding table, no nonlinearity, no reference index. | |
| ## Usage | |
| ```python | |
| from model import DnaOriginClassifier | |
| clf = DnaOriginClassifier("model.safetensors") | |
| seq = "ATGGCTAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT" | |
| clf.classify(seq) # 'human' | 'eukaryote' | 'bacteria' | 'virus' | 'engineered' | |
| clf.host_score(seq) # higher = more human/host-like | |
| clf.engineered_score(seq) # higher = more likely engineered/synthetic | |
| clf.attribute(seq) # exact per-base contribution to the host score (sums to the score) | |
| clf.certify(seq) # base substitutions that flip the host call | |
| ``` | |
| The repository ships the model with four small tools, each requiring only `numpy` and `safetensors`: | |
| - `design.py` generates sequences that maximize or minimize a head, either from scratch or by | |
| synonymous codon choice with the encoded protein held fixed. | |
| - `dna_filter.py` is a FASTQ/FASTA read filter built on the host head, in two modes: host depletion for | |
| pathogen enrichment, and human removal for privacy. | |
| - `boundary.py` computes the composition-solvability distance D for two classes of sequences, or for | |
| many, with no training. It is the criterion behind the boundary sections below. | |
| - `Detector.v` and `CompositionBoundary.v` are Rocq proofs, described where they are used. | |
| ## Evaluation | |
| On the test and novel-taxa splits of | |
| [dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark) (24,950 | |
| fragments, 382 organisms, cluster-level splits so that train and test share no near-duplicate cluster): | |
| | task | head | test | novel taxa | | |
| |---|---|---|---| | |
| | human vs non-host | `host` | 0.993 | 0.990 | | |
| | engineered vs natural | `engineered` | 0.919 | 0.896 | | |
| Five-class origin accuracy is 0.708 against a random baseline of 0.20. The novel-taxa column holds out | |
| entire clades, so the small drop from test to novel measures generalization to organisms never seen in | |
| training rather than memorization of databased sequence. | |
| ## How it compares | |
| On the same splits and tasks: | |
| - **Database tools.** Kraken2 against RefSeq matches the model on databased taxa (host sensitivity | |
| 0.972, specificity 1.000). On sequence absent from every database, Kraken2 classifies 0% of reads and | |
| BLAST 6.6%, while this model calls 100%, because it needs no reference to consult. | |
| - **Learned reference-free models.** A fine-tuned 110M genomic language model (DNABERT) matches it on | |
| host detection (0.989 against 0.990 on novel clades); a CNN and a BiLSTM trained on the same data land | |
| below at 0.94 and 0.97. The linear model reaches that level at roughly 200 times fewer parameters, on | |
| a CPU. | |
| ## What the linear form gives you | |
| Because the score is an exact linear function of 8-mer counts, the model supports operations a black-box | |
| classifier cannot do cleanly: | |
| - **Certified robustness.** `certify` returns, by greedy search with exact per-edit deltas, a count of | |
| base substitutions that flips a call, an upper bound on the true adversarial distance: a typical human | |
| read takes a median of 3 of 300 to read as non-host, an engineered sequence 4 to pass as natural. The | |
| matching lower bound is proved from the linear form and machine-checked in `Detector.v` (Rocq, over | |
| the rationals, no axioms and no `admit`): one substitution moves the score by at most 2k times the | |
| largest effective weight, so no edit within the margin divided by that constant can flip the call. | |
| - **Exact attribution.** `attribute` decomposes any call into per-base contributions that sum exactly to | |
| the score, with no gradient approximation. The same decomposition is the attribution-exactness identity | |
| in the proof file. | |
| - **Inverse design.** Coordinate ascent on a head designs sequences to order. From scratch it reaches | |
| host_score 22, far past the natural human ceiling of about 6. Holding a protein fixed, synonymous codon | |
| choice alone moves the same gene from host_score −15 to +14. The two heads give two synonymous channels | |
| at once, so a single gene stamps into any of the four (host, engineered) sign-quadrants with the protein | |
| unchanged and the detector reads the two bits back: a protein-preserving provenance watermark. | |
| - **Weight arithmetic.** Detectors compose in weight space. A host detector built purely by algebra on | |
| the origin rows, `human − ½(bacteria + virus)`, never trained as such, scores 0.992 against the trained | |
| head's 0.993. | |
| ## The 8-mer atlas | |
| The model's learned weights are published as a standalone dataset, | |
| [dna-origin-atlas](https://huggingface.co/datasets/phanerozoic/dna-origin-atlas): all 65,536 8-mer | |
| weights for each head, annotated with GC and CpG content. The dominant axis is biological. CpG-bearing | |
| 8-mers carry a mean host weight of −6.47 against +0.27 for non-CpG 8-mers, so the table encodes the | |
| vertebrate CpG-depletion (methylation) signature as a number. The discrimination rests on bulk | |
| composition of this kind rather than on a lexicon of named regulatory motifs. | |
| ## The partition: identity is shallow, function is deep | |
| One asymmetry runs through everything here. Sequence identity is shallow: it compresses to this 2 MB | |
| linear model, separable and invertible, reaching 0.99 AUROC on human versus non-host, and the learned | |
| origin space separates bacteria, eukaryotes, and viruses without a database. Sequence function is deep: | |
| a compact model trained on labeled variants reaches only 0.56 on variant effect, far below the 0.85 of | |
| the 8B language model, so the function signal does not compress from supervision the way identity does. | |
|  | |
| *Origin-head weights alone separate bacteria, eukaryotes, and viruses in a PCA of the model's learned | |
| space; the single human point sits far from all three. No reference database is consulted.* | |
| The split is not specific to DNA. Building the same composition classifier on other molecules: | |
| | molecule | identity (composition) | function (composition) | | |
| |---|---|---| | |
| | DNA | 0.99 | 0.56 | | |
| | protein | 0.91 | 0.73 | | |
| | RNA | 1.00 | n/a | | |
| Identity compresses to composition across alphabets; function does not. The protein function entry reads | |
| 0.73 rather than chance because the amino-acid substitution itself carries chemical signal, but its | |
| information fraction (0.14) stays below the boundary, and the context-dependent part is absent. The | |
| boundary is predictable in information terms: a task is composition-solvable when the mutual information | |
| between the k-mer spectrum and its label exceeds about 0.2 of the label entropy. Across ten tasks spanning | |
| DNA and protein, a single threshold near 0.20 separates every identity task from every function task, and | |
| the two cases whose side is not clear in advance land where it predicts. Human against other eukaryotes, | |
| the hardest identity task at 0.26, stays above; missense pathogenicity, the strongest function signal at | |
| 0.14, stays below; pathogenicity read from the surrounding composition alone sits at chance. | |
|  | |
| *Composition AUROC against the captured information fraction. Identity tasks (blue) lie above the | |
| threshold, function tasks (red) below, with DNA and protein on the same axis.* | |
| Generation makes the separation concrete. Coordinate ascent on the host head designs sequences with | |
| host_score up to 22, against a natural ceiling of about 6, and these designs sit far outside the | |
| region of composition space occupied by any natural class, where the detector saturates but no | |
| biology lives. The 8B language model rates those same designs as | |
| increasingly unnatural, so composition and neural naturalness are separable axes rather than two views of | |
| one thing. `ADVERSARIAL.md` covers the order-(k-1) evasion boundary and why a language model resists it; | |
| `TOOL.md` covers footprint, throughput, read length, and operating modes. | |
|  | |
| *Natural sequences of every origin class occupy one connected region of composition space (center); | |
| coordinate-ascent designs (stars) land far outside it, where the host head saturates but no biology lives.* | |
| ## The composition boundary | |
| The partition has a boundary that can be computed before any training and a proof that it holds. View a | |
| length-L sequence as N = L - k + 1 draws from a class-conditional k-mer distribution, which is all a | |
| composition classifier ever sees. The Bhattacharyya coefficient between the two classes' count | |
| distributions then factorizes exactly to rho^N, with rho the per-k-mer overlap sum of | |
| sqrt(theta_0 theta_1); the Bayes error of any count-based classifier is at most half of rho^N; and for a | |
| balanced label the mutual information equals the Jensen-Shannon divergence between the count | |
| distributions. Solvability is therefore set by D = -N ln rho, read off the class distributions with no | |
| classifier. Equal distributions give rho = 1 and chance; rho below 1 gives an error that vanishes with | |
| length. These four statements are machine-checked in `CompositionBoundary.v` (Rocq, over Coq's standard | |
| reals, no `admit`). | |
| The first statement of the boundary treats overlapping k-mers as independent, which overstates | |
| separability where the sequence is correlated. The proof file also carries the correlated generalization: | |
| model the sequence as an order-1 Markov chain over states, with amp the initial Bhattacharyya amplitude | |
| and a transfer matrix whose entries are per-transition Bhattacharyya weights, and the path coefficient is | |
| the transfer-matrix evaluation, whose per-step growth is the top eigenvalue of that matrix rather than the | |
| scalar rho. The generalization reduces to rho^N exactly when the chain has no memory, which is the iid | |
| case, and that reduction is the machine-checked theorem. Recomputed at order 5, the distance tracks | |
| measured performance more closely (rank correlation with AUROC 0.92 against 0.87 for the iid form) and the | |
| worst overstatement, the enhancer tasks, falls from a distance of 9.0 to 2.3, in line with their 0.79 | |
| AUROC. | |
|  | |
| *Left, the iid distance against measured AUROC; right, the order-5 Markov distance. The correlated form | |
| pulls the over-separated tasks back toward the line and raises the rank correlation.* | |
| ## A foundation model against the linear baseline | |
| Run the criterion across the sixteen binary Nucleotide Transformer tasks, the genomic benchmark these | |
| models are sold on, and D predicts task by task whether composition suffices. A fine-tuned 110M DNABERT | |
| bears it out. On the promoters, where D is high and the 2 MB linear model already reaches 0.96, DNABERT | |
| adds only 0.02 to 0.04: a near tie, with little room for any model. The splice donor and acceptor sites | |
| are the opposite extreme: at D near 1 the linear model reaches about 0.75 while DNABERT reaches 0.99, a | |
| gain near a quarter of an AUROC, because the motif that separates the classes is positional and sits | |
| beneath the composition a bag of k-mers reads. The histone marks fall in between, with gains of 0.05 to | |
| 0.10 at intermediate D. Averaged over the benchmark, DNABERT's gain over the linear model is +0.097 below | |
| the boundary against +0.054 above it, and a full-data fine-tune on the decisive tasks (splice at 0.99, | |
| promoter at 0.99) reproduces the split. The foundation model's distinct capability concentrates where the | |
| criterion places it, scaling with how far below the boundary the task sits. | |
|  | |
| *Left, D against test AUROC for the linear model and DNABERT across the sixteen tasks; right, the | |
| DNABERT-minus-linear gain against D, concentrated below the threshold, splice sites foremost.* | |
| The same line holds across suites and molecules. D and the linear baseline reproduce the partition over the | |
| Genomic Benchmarks and the available GUE tasks as well as over the Nucleotide Transformer set, about thirty | |
| tasks in three suites. Real foundation models confirm the cross-molecule reading: amino-acid composition | |
| separates eukaryote from bacterium above the boundary (D 4.1, where an ESM probe adds 0.05) but cannot | |
| resolve subcellular localization below it (D_min 0.4, where ESM adds 0.26); for RNA, composition and an | |
| RNA-FM probe both clear the human-versus-yeast non-coding identity task above the boundary (D 15, RNA-FM | |
| ahead by 0.03). In every molecule the learned model adds little above the line and most of its value below | |
| it. | |
|  | |
| *Composition AUROC against D for DNA tasks from three suites plus the protein and RNA identity points, all | |
| on one axis.* | |
| A forward test fixes the direction. On a three-class splice dataset from a different source than any used to | |
| set the threshold, D was computed first and predicted that composition cannot separate any pair of classes | |
| (D between 0.13 and 0.22); the linear baseline then scored 0.53 to 0.66, confirming all three calls before a | |
| model was run. Splice sounds tractable, and the criterion calls its failure for composition in advance. | |
| The consequence is a re-scoring rather than a new tool. A sequence-classification benchmark can be ordered by | |
| D with no training; its above-boundary entries are k-mer baselines that a 2 MB model matches, and a | |
| foundation model's distinct contribution is whatever it adds on the entries below. Variant and fitness | |
| assays such as ProteinGym are mutant ensembles rather than composition samples, so they sit outside this | |
| model and in the deep regime by construction. The full GUE set, BEND, and an RNA function benchmark need | |
| data not mirrored on HuggingFace, so RNA enters here on the identity side. | |
| ## Screening and hybrid use | |
| - **Certified screening frontier.** To pass an engineered construct as natural, an adversary must make a | |
| median of 4 base substitutions (46% need 5 or more) or reproduce the full order-7 composition of a | |
| natural class. Detection holds for 60% of constructs after the best 3-edit attack and for none after 9 | |
| edits. Composition screening is therefore sound against light point edits and blind to a construct that | |
| reproduces natural composition, and the boundary states precisely where that blindness begins. | |
| - **Strictly dominant hybrid.** Composing this detector with a database method, taking the database call | |
| where it classifies and the composition call otherwise, reaches 0.994 correct on a mixed databased and | |
| undatabased set, above Kraken2 alone (0.337, blind to undatabased sequence) and this model alone (0.980), | |
| and by construction never worse than either component. | |
| ## Calibration | |
| The heads output linear margins, not calibrated probabilities. Use the argmax of `logits` for origin and | |
| the ranking or sign of `host_score` and `engineered_score` for the binary tasks. | |
| ## Training data | |
| All heads are fit on the train split of `dna-origin-benchmark`: RefSeq coding sequences across roughly 370 | |
| taxa, NCBI synthetic-construct and vector records, synonymous recodings, and environmental metagenome | |
| fragments. Splits are drawn at the cluster level so that evaluation never sees a near-duplicate of a training | |
| sequence, and a novel-taxa split holds out whole clades. | |
| ## Files in this repository | |
| - `model.safetensors`, `model.py`: the classifier and its loader. | |
| - `boundary.py`: the composition-solvability distance D, for two classes or for many, no training. | |
| - `Detector.v`: machine-checked certified robustness and exact attribution for the linear detector. | |
| - `CompositionBoundary.v`: the boundary theorem (the factorization, the Bayes-error bound, the | |
| Jensen-Shannon identity, the dichotomy) and its Markov-correlated generalization. | |
| - `design.py`, `dna_filter.py`: inverse design and the read filter. | |
| - `ADVERSARIAL.md`, `TOOL.md`: the evasion analysis and the deployment characteristics. | |
| - Figures: `tree_of_life.png`, `partition_law.png`, `analytical_crossover.png`, `manifold.png`, | |
| `composition_boundary_fm.png`, `cross_suite.png`, `markov_d.png`. | |
| ## References | |
| - **Base model (derived by analysis):** [HuggingFaceBio/Carbon-8B](https://huggingface.co/HuggingFaceBio/Carbon-8B). | |
| This classifier was derived from Carbon-8B by identifying the closed-form k-mer statistic that reproduces | |
| its host and identity discrimination. The derivation is analytical: it holds no Carbon weights or outputs | |
| and is fit to public NCBI sequence, hence MIT. | |
| - **Benchmark and baselines:** [phanerozoic/dna-origin-benchmark](https://huggingface.co/datasets/phanerozoic/dna-origin-benchmark) | |
| and [phanerozoic/dna-origin-atlas](https://huggingface.co/datasets/phanerozoic/dna-origin-atlas). | |
| - **Foundation-model benchmarks compared against:** the Nucleotide Transformer downstream tasks, the | |
| Genomic Benchmarks, GUE, with DNABERT, ESM-2, and RNA-FM as the learned models. | |
| - **Background:** k-mer composition classifiers for sequence origin (RDP Classifier, Wang et al. 2007; | |
| genomic signatures, Karlin and Burge 1995). | |
| ## License | |
| MIT. | |