Drop unreproducible design-distance multiple; correct undatabased non-host rate to 99%
caba6d4 verified | # Reference-free human/host read filter | |
| `dna_filter.py` wraps the classifier's `host` head as a read filter: FASTA/FASTQ in, per-read | |
| call out, no alignment and no database. | |
| ## Modes | |
| - **deplete-host** (pathogen enrichment): emit the non-host reads, discarding human. The standard | |
| host-depletion step in clinical metagenomics, done without aligning to the human reference. | |
| - **scrub-human** (privacy): remove human reads before sharing or deposition. | |
| - **classify**: per-read origin and scores, no filtering. | |
| ```bash | |
| python dna_filter.py reads.fastq.gz --mode deplete-host --out nonhost.fasta --report calls.tsv | |
| python dna_filter.py reads.fastq --mode scrub-human --out scrubbed.fasta | |
| python dna_filter.py reads.fasta --mode classify --report calls.tsv | |
| ``` | |
| ## Footprint and throughput | |
| - **Model:** 2 MB safetensors (524,295 parameters, single k=8 head set). No reference database, no GPU. | |
| - **Dependencies:** `numpy` and `safetensors` only. | |
| - **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across | |
| reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi- | |
| gigabyte human reference. | |
| ## Read-length behavior | |
| Host vs non-host AUROC by read length: | |
| | length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 | | |
| |---|---|---|---|---|---|---|---| | |
| | AUROC | 0.955 | 0.986 | 0.998 | 0.999 | 1.000 | 1.000 | 1.000 | | |
| It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still | |
| useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads | |
| while retaining essentially all non-host from 150 bp up; below 150 bp the separation narrows and | |
| the threshold should be tuned toward the mode's priority (aggressive removal for privacy, | |
| conservative retention for enrichment). | |
| ## Scope | |
| - **Strong:** human against bacterial and viral sequence, the clinically dominant contrast. | |
| - **Reference-free advantage:** on sequence absent from every database, Kraken2 classifies 0% and | |
| BLAST 6.6%, while this filter calls 100% (99% as non-host). It is the only option when the | |
| sequence has no database match, for example environmental or divergent material. | |
| - **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition); | |
| use it for host-vs-microbe, not for separating vertebrate species. | |
| The filter trades a database and an aligner for a 2 MB model that runs anywhere, at the cost of | |
| the per-base certainty an exact match gives when the organism is already in a reference. | |