phanerozoic
/

dna-origin-classifier

@@ -18,7 +18,7 @@ python dna_filter.py reads.fasta    --mode classify     --report calls.tsv
 ## Footprint and throughput
-- **Model:** 1.12 MB safetensors (280,903 parameters). No reference database, no GPU.
 - **Dependencies:** `numpy` and `safetensors` only.
 - **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
   reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
@@ -30,7 +30,7 @@ Host vs non-host AUROC by read length:
 | length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
 |---|---|---|---|---|---|---|---|
-| AUROC | 0.941 | 0.980 | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 |
 It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
 useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
@@ -47,5 +47,5 @@ conservative retention for enrichment).
 - **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
   use it for host-vs-microbe, not for separating vertebrate species.
-The filter trades a database and an aligner for a 1 MB lookup that runs anywhere, at the cost of
 the per-base certainty an exact match gives when the organism is already in a reference.

 ## Footprint and throughput
+- **Model:** 2 MB safetensors (524,295 parameters, single k=8 head set). No reference database, no GPU.
 - **Dependencies:** `numpy` and `safetensors` only.
 - **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
   reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
 | length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
 |---|---|---|---|---|---|---|---|
+| AUROC | 0.955 | 0.986 | 0.998 | 0.999 | 1.000 | 1.000 | 1.000 |
 It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
 useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
 - **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
   use it for host-vs-microbe, not for separating vertebrate species.
+The filter trades a database and an aligner for a 2 MB model that runs anywhere, at the cost of
 the per-base certainty an exact match gives when the organism is already in a reference.