phanerozoic commited on
Commit
42fc7f4
·
verified ·
1 Parent(s): b8f10aa

Ship single k=8 discriminative model (host 0.993/0.990, engineered 0.919/0.896, 5-class 0.708); breaks adversary at order 7

Browse files
Files changed (1) hide show
  1. TOOL.md +3 -3
TOOL.md CHANGED
@@ -18,7 +18,7 @@ python dna_filter.py reads.fasta --mode classify --report calls.tsv
18
 
19
  ## Footprint and throughput
20
 
21
- - **Model:** 1.12 MB safetensors (280,903 parameters). No reference database, no GPU.
22
  - **Dependencies:** `numpy` and `safetensors` only.
23
  - **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
24
  reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
@@ -30,7 +30,7 @@ Host vs non-host AUROC by read length:
30
 
31
  | length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
32
  |---|---|---|---|---|---|---|---|
33
- | AUROC | 0.941 | 0.980 | 0.996 | 0.999 | 1.000 | 1.000 | 1.000 |
34
 
35
  It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
36
  useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
@@ -47,5 +47,5 @@ conservative retention for enrichment).
47
  - **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
48
  use it for host-vs-microbe, not for separating vertebrate species.
49
 
50
- The filter trades a database and an aligner for a 1 MB lookup that runs anywhere, at the cost of
51
  the per-base certainty an exact match gives when the organism is already in a reference.
 
18
 
19
  ## Footprint and throughput
20
 
21
+ - **Model:** 2 MB safetensors (524,295 parameters, single k=8 head set). No reference database, no GPU.
22
  - **Dependencies:** `numpy` and `safetensors` only.
23
  - **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
24
  reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
 
30
 
31
  | length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
32
  |---|---|---|---|---|---|---|---|
33
+ | AUROC | 0.955 | 0.986 | 0.998 | 0.999 | 1.000 | 1.000 | 1.000 |
34
 
35
  It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
36
  useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
 
47
  - **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
48
  use it for host-vs-microbe, not for separating vertebrate species.
49
 
50
+ The filter trades a database and an aligner for a 2 MB model that runs anywhere, at the cost of
51
  the per-base certainty an exact match gives when the organism is already in a reference.