Ship single k=8 discriminative model (host 0.993/0.990, engineered 0.919/0.896, 5-class 0.708); breaks adversary at order 7
Browse files
TOOL.md
CHANGED
|
@@ -18,7 +18,7 @@ python dna_filter.py reads.fasta --mode classify --report calls.tsv
|
|
| 18 |
|
| 19 |
## Footprint and throughput
|
| 20 |
|
| 21 |
-
- **Model:**
|
| 22 |
- **Dependencies:** `numpy` and `safetensors` only.
|
| 23 |
- **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
|
| 24 |
reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
|
|
@@ -30,7 +30,7 @@ Host vs non-host AUROC by read length:
|
|
| 30 |
|
| 31 |
| length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
|
| 32 |
|---|---|---|---|---|---|---|---|
|
| 33 |
-
| AUROC | 0.
|
| 34 |
|
| 35 |
It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
|
| 36 |
useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
|
|
@@ -47,5 +47,5 @@ conservative retention for enrichment).
|
|
| 47 |
- **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
|
| 48 |
use it for host-vs-microbe, not for separating vertebrate species.
|
| 49 |
|
| 50 |
-
The filter trades a database and an aligner for a
|
| 51 |
the per-base certainty an exact match gives when the organism is already in a reference.
|
|
|
|
| 18 |
|
| 19 |
## Footprint and throughput
|
| 20 |
|
| 21 |
+
- **Model:** 2 MB safetensors (524,295 parameters, single k=8 head set). No reference database, no GPU.
|
| 22 |
- **Dependencies:** `numpy` and `safetensors` only.
|
| 23 |
- **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
|
| 24 |
reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
|
|
|
|
| 30 |
|
| 31 |
| length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
|
| 32 |
|---|---|---|---|---|---|---|---|
|
| 33 |
+
| AUROC | 0.955 | 0.986 | 0.998 | 0.999 | 1.000 | 1.000 | 1.000 |
|
| 34 |
|
| 35 |
It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
|
| 36 |
useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
|
|
|
|
| 47 |
- **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
|
| 48 |
use it for host-vs-microbe, not for separating vertebrate species.
|
| 49 |
|
| 50 |
+
The filter trades a database and an aligner for a 2 MB model that runs anywhere, at the cost of
|
| 51 |
the per-base certainty an exact match gives when the organism is already in a reference.
|