Drop unreproducible design-distance multiple; correct undatabased non-host rate to 99%

caba6d4 verified 3 days ago

2.55 kB

	# Reference-free human/host read filter

	`dna_filter.py` wraps the classifier's `host` head as a read filter: FASTA/FASTQ in, per-read
	call out, no alignment and no database.

	## Modes

	- deplete-host (pathogen enrichment): emit the non-host reads, discarding human. The standard
	host-depletion step in clinical metagenomics, done without aligning to the human reference.
	- scrub-human (privacy): remove human reads before sharing or deposition.
	- classify: per-read origin and scores, no filtering.

	```bash
	python dna_filter.py reads.fastq.gz --mode deplete-host --out nonhost.fasta --report calls.tsv
	python dna_filter.py reads.fastq --mode scrub-human --out scrubbed.fasta
	python dna_filter.py reads.fasta --mode classify --report calls.tsv
	```

	## Footprint and throughput

	- Model: 2 MB safetensors (524,295 parameters, single k=8 head set). No reference database, no GPU.
	- Dependencies: `numpy` and `safetensors` only.
	- Throughput: ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
	reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
	gigabyte human reference.

	## Read-length behavior

	Host vs non-host AUROC by read length:

	\| length \| 50 \| 100 \| 150 \| 200 \| 300 \| 600 \| 1000 \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| AUROC \| 0.955 \| 0.986 \| 0.998 \| 0.999 \| 1.000 \| 1.000 \| 1.000 \|

	It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
	useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
	while retaining essentially all non-host from 150 bp up; below 150 bp the separation narrows and
	the threshold should be tuned toward the mode's priority (aggressive removal for privacy,
	conservative retention for enrichment).

	## Scope

	- Strong: human against bacterial and viral sequence, the clinically dominant contrast.
	- Reference-free advantage: on sequence absent from every database, Kraken2 classifies 0% and
	BLAST 6.6%, while this filter calls 100% (99% as non-host). It is the only option when the
	sequence has no database match, for example environmental or divergent material.
	- Not for: discriminating closely related mammals (human vs mouse/rat is weak by composition);
	use it for host-vs-microbe, not for separating vertebrate species.

	The filter trades a database and an aligner for a 2 MB model that runs anywhere, at the cost of
	the per-base certainty an exact match gives when the organism is already in a reference.