File size: 2,549 Bytes
bb17e1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42fc7f4
bb17e1a
 
 
 
 
 
 
 
 
 
 
42fc7f4
bb17e1a
 
 
 
 
 
 
 
 
 
 
caba6d4
bb17e1a
 
 
 
42fc7f4
bb17e1a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Reference-free human/host read filter

`dna_filter.py` wraps the classifier's `host` head as a read filter: FASTA/FASTQ in, per-read
call out, no alignment and no database.

## Modes

- **deplete-host** (pathogen enrichment): emit the non-host reads, discarding human. The standard
  host-depletion step in clinical metagenomics, done without aligning to the human reference.
- **scrub-human** (privacy): remove human reads before sharing or deposition.
- **classify**: per-read origin and scores, no filtering.

```bash
python dna_filter.py reads.fastq.gz --mode deplete-host --out nonhost.fasta --report calls.tsv
python dna_filter.py reads.fastq    --mode scrub-human  --out scrubbed.fasta
python dna_filter.py reads.fasta    --mode classify     --report calls.tsv
```

## Footprint and throughput

- **Model:** 2 MB safetensors (524,295 parameters, single k=8 head set). No reference database, no GPU.
- **Dependencies:** `numpy` and `safetensors` only.
- **Throughput:** ~5,600 reads/s on a single CPU thread at 300 bp; embarrassingly parallel across
  reads. For comparison, a Kraken2 RefSeq index is about 8 GB and an aligner needs the multi-
  gigabyte human reference.

## Read-length behavior

Host vs non-host AUROC by read length:

| length | 50 | 100 | 150 | 200 | 300 | 600 | 1000 |
|---|---|---|---|---|---|---|---|
| AUROC | 0.955 | 0.986 | 0.998 | 0.999 | 1.000 | 1.000 | 1.000 |

It is near-perfect from 150 bp up (typical Illumina paired-end and assembled contigs) and still
useful at 50 bp. At a balanced threshold of 0, scrub-human removes at least 99% of human reads
while retaining essentially all non-host from 150 bp up; below 150 bp the separation narrows and
the threshold should be tuned toward the mode's priority (aggressive removal for privacy,
conservative retention for enrichment).

## Scope

- **Strong:** human against bacterial and viral sequence, the clinically dominant contrast.
- **Reference-free advantage:** on sequence absent from every database, Kraken2 classifies 0% and
  BLAST 6.6%, while this filter calls 100% (99% as non-host). It is the only option when the
  sequence has no database match, for example environmental or divergent material.
- **Not for:** discriminating closely related mammals (human vs mouse/rat is weak by composition);
  use it for host-vs-microbe, not for separating vertebrate species.

The filter trades a database and an aligner for a 2 MB model that runs anywhere, at the cost of
the per-base certainty an exact match gives when the organism is already in a reference.