DeepTaxa: Hierarchical 16S rRNA Taxonomy Classification

DeepTaxa is a deep learning model for hierarchical taxonomy classification of 16S rRNA gene sequences. The architecture couples a convolutional branch, which captures local k-mer motifs, with a BERT-style transformer, which captures long-range context. Both branches operate over tokens produced by the DNABERT-2 byte-pair encoder. Predictions are generated jointly for all seven standard taxonomic ranks: domain, phylum, class, order, family, genus, and species.

Three checkpoints are released here: one trained on full-length 16S sequences, one trained on V3-V4 amplicons, and one trained on the shorter V4 amplicon.

Checkpoint selection

Sequencing protocol	Recommended checkpoint	File
Sanger 27F/1492R, PacBio HiFi 16S, Oxford Nanopore long-read 16S, full-length reference lookup	Full-length v1	`deeptaxa-full-length-v1.pt`
Illumina paired-end V3-V4 with 341F/805R primers	V3-V4 v1	`deeptaxa-v3v4-v1.pt`
Illumina paired-end V4 with 515F/806R primers	V4 v1	`deeptaxa-v4-v1.pt`

Released checkpoints

Checkpoint	Training data	Species Acc	Species F1	Species ECE	Params
Full-length v1	277,336 full-length 16S sequences (approximately 1,500 bp) from Greengenes2	92.88%	92.03%	0.0251	76.4 M
V3-V4 v1	273,003 in-silico V3-V4 extractions (approximately 420 bp) from Greengenes2	87.55%	85.92%	0.0278	75.8 M
V4 v1	274,509 in-silico V4 extractions (approximately 253 bp) from Greengenes2	82.84%	80.16%	0.0256	76.4 M

Species-level metrics above are single-seed (seed 42) test-set values for the published checkpoints. Across three seeds (42, 123, 456) the full-length checkpoint achieves species accuracy of 92.96% +/- 0.07 pp and species F1 of 92.12% +/- 0.08 pp, indicating high reproducibility. The V3-V4 and V4 checkpoints are released as single-seed (seed 42) models.

All three checkpoints are inference-only. Optimizer and scheduler state have been removed to reduce file size; resuming training from these checkpoints is not supported.

Architecture

The full-length, V3-V4, and V4 checkpoints share the canonical compact HybridCNNBERT configuration. The full-length and V3-V4 checkpoints were updated in April 2026 to this smaller, faster architecture: the full-length checkpoint matched or beat its prior numbers at every taxonomic rank with roughly 32% fewer parameters; the V3-V4 checkpoint achieved equivalent species-level performance (Acc 87.55% vs 87.52%, F1 85.92% vs 85.79%) at roughly 24% fewer parameters. The V4 checkpoint (added June 2026) was trained from scratch in this same configuration on in-silico V4 amplicons.

Component	Full-length v1	V3-V4 v1	V4 v1
`tokenizer_name`	`zhihan1996/DNABERT-2-117M`	`zhihan1996/DNABERT-2-117M`	`zhihan1996/DNABERT-2-117M`
`max_length`	512 (tokens)	512 (tokens)	512 (tokens)
`embed_dim`	896	896	896
`num_filters`	256	256	256
`kernel_sizes`	`[3, 5, 7]`	`[3, 5, 7]`	`[3, 5, 7]`
`num_conv_layers`	1	1	1
`hidden_size`	896	896	896
`num_hidden_layers`	4	4	4
`num_attention_heads`	7	7	7
`intermediate_size`	3584	3584	3584
`hidden_dropout_prob`	0.20	0.20	0.20

Test-set performance

All three checkpoints were evaluated on their respective held-out Greengenes2 2024.09 test splits. Numbers below are seed 42 test-set values for the published checkpoints.

Rank	Full-length Acc	Full-length F1	V3-V4 Acc	V3-V4 F1	V4 Acc	V4 F1
Domain	99.99%	99.99%	99.99%	99.99%	99.98%	99.98%
Phylum	99.68%	99.67%	99.68%	99.66%	99.59%	99.57%
Class	99.63%	99.59%	99.64%	99.60%	99.54%	99.49%
Order	99.09%	98.99%	98.99%	98.88%	98.77%	98.67%
Family	98.61%	98.41%	98.41%	98.19%	98.10%	97.88%
Genus	96.93%	96.51%	95.27%	94.73%	93.46%	92.65%
Species	92.88%	92.03%	87.55%	85.92%	82.84%	80.16%

Training configuration

Parameter	Full-length v1	V3-V4 v1	V4 v1
Training data	Greengenes2 2024.09 training set (277,336 full-length sequences, approximately 1,500 bp)	In-silico V3-V4 extractions from the same training set (273,003 amplicons)	In-silico V4 extractions from the same training set (274,509 amplicons)
Test data	Greengenes2 2024.09 test split (69,335 full-length sequences)	V3-V4 extractions from the test split (68,282 amplicons)	V4 extractions from the test split (68,668 amplicons)
Extraction primers	N/A	341F `CCTACGGGNGGCWGCAG` and 805R `GACTACHVGGGTATCTAATCC`	515F `GTGYCAGCMGCCGCGGTAA` and 806R `GGACTACNVGGGTWTCTAAT`
Label space (species)	16,909	8,347	16,909
Label space (domain / phylum / class / order / family / genus)	2 / 129 / 349 / 997 / 2,250 / 7,287	2 / 115 / 270 / 709 / 1,528 / 4,529	2 / 129 / 349 / 997 / 2,250 / 7,287
Total parameters	76,365,205	75,813,550	76,365,205
Learning rate	5e-4	5e-4	5e-4
Batch size	64	64	64
Weight decay	1e-2	1e-2	1e-2
Epochs	10	10	10
Loss	Cross-entropy with uniform per-rank weights	Cross-entropy with uniform per-rank weights	Cross-entropy with uniform per-rank weights
Optimizer	AdamW (beta1 = 0.9, beta2 = 0.999)	AdamW (beta1 = 0.9, beta2 = 0.999)	AdamW (beta1 = 0.9, beta2 = 0.999)
Learning rate schedule	Linear warm-up over 10% of steps, followed by linear decay	Linear warm-up over 10% of steps, followed by linear decay	Linear warm-up over 10% of steps, followed by linear decay
Seed	42	42	42
Hardware	NVIDIA GeForce RTX 4090	NVIDIA GeForce RTX 4090	NVIDIA A40

Usage

Download

# Full-length checkpoint
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-full-length-v1.pt

# V3-V4 checkpoint
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-v3v4-v1.pt

# V4 checkpoint
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-v4-v1.pt

# Or clone the full repository
git clone https://huggingface.co/systems-genomics-lab/deeptaxa

Python API with huggingface_hub:

from huggingface_hub import hf_hub_download

# Full-length
full_length_ckpt = hf_hub_download(
    repo_id="systems-genomics-lab/deeptaxa",
    filename="deeptaxa-full-length-v1.pt",
)

# V3-V4
v3v4_ckpt = hf_hub_download(
    repo_id="systems-genomics-lab/deeptaxa",
    filename="deeptaxa-v3v4-v1.pt",
)

# V4
v4_ckpt = hf_hub_download(
    repo_id="systems-genomics-lab/deeptaxa",
    filename="deeptaxa-v4-v1.pt",
)

Install DeepTaxa and run predictions

pip install git+https://github.com/systems-genomics-lab/deeptaxa.git

# Full-length sequences (Sanger, PacBio HiFi, Oxford Nanopore)
deeptaxa predict \
  --fasta-file your_full_length_16s.fna.gz \
  --checkpoint deeptaxa-full-length-v1.pt \
  --output-dir predictions/

# V3-V4 amplicons (Illumina, already demultiplexed and primer-trimmed)
deeptaxa predict \
  --fasta-file your_v3v4_amplicons.fna.gz \
  --checkpoint deeptaxa-v3v4-v1.pt \
  --output-dir predictions/

# V4 amplicons (Illumina, already demultiplexed and primer-trimmed)
deeptaxa predict \
  --fasta-file your_v4_amplicons.fna.gz \
  --checkpoint deeptaxa-v4-v1.pt \
  --output-dir predictions/

Input preparation for amplicon checkpoints: the input FASTA file should contain region-matched sequences that have already been demultiplexed and primer-trimmed by an upstream tool such as DADA2, cutadapt, or QIIME2. The V3-V4 and V4 checkpoints were trained on in-silico primer extractions (341F/805R and 515F/806R respectively), which approximate merged paired-end amplicons. Paired-end reads should therefore be merged into consensus amplicons prior to prediction, or the forward read alone may be provided.

Full usage documentation and analysis notebooks are available in the GitHub repository.

Limitations

Limitations that apply to both checkpoints:

Approximately 44.8% of Greengenes2 species have only a single training example, which limits reliable prediction for those classes.
The label space corresponds to Greengenes2 2024.09. Predictions are produced against the exact Greengenes2 hierarchy, and species absent from the training data cannot be predicted. Adapting the model to a different reference database, such as SILVA or GTDB, would require retraining.
A GPU is strongly recommended; CPU inference is impractical for large datasets.

Limitations specific to the full-length checkpoint:

Best performance is obtained on sequences of at least 1,200 bp; shorter amplicons should be classified using the V3-V4 checkpoint.
Species-level accuracy plateaus near 93%.

Limitations specific to the V3-V4 checkpoint:

Species-level accuracy plateaus near 87.5%. The approximately 420 bp V3-V4 region carries less taxonomic information than the full 16S gene.
The label space contains 8,347 species. Those for which no V3-V4 amplicon could be extracted during training are absent and cannot be predicted.
Primer specificity: the model was trained on 341F/805R extractions. Sequences amplified with other V3-V4 primers, such as 357F or 338F, or with substantially different region boundaries may yield degraded predictions.

Limitations specific to the V4 checkpoint:

Species-level accuracy is approximately 82.8%. The approximately 253 bp V4 region carries less taxonomic information than V3-V4 or the full 16S gene, so species-level calls should be read together with their confidence scores.
The label space contains 16,909 species (the same as the full-length checkpoint), retained because V4 amplicons were extracted at 99.0% yield. Species for which no V4 amplicon could be extracted during training are absent and cannot be predicted.
Primer specificity: the model was trained on 515F/806R extractions. Sequences amplified with other V4 primers or with substantially different region boundaries may yield degraded predictions.
The V4 checkpoint is released as a single-seed (seed 42) model; no cross-seed standard deviation is reported for it.

Citation

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa}
}

References

Akiba, T., Sano, S., Yanase, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623-2631. DOI: 10.1145/3292500.3330701
Bolyen, E., Rideout, J.R., Dillon, M.R., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology, 37(8), 852-857. DOI: 10.1038/s41587-019-0209-9
Callahan, B.J., McMurdie, P.J., Rosen, M.J., et al. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581-583. DOI: 10.1038/nmeth.3869
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), 10-12. DOI: 10.14806/ej.17.1.200
McDonald, D., Jiang, Y., Balaban, M., et al. (2024). Greengenes2 unifies microbial data in a single reference tree. Nature Biotechnology, 42(5), 715-718. DOI: 10.1038/s41587-023-01845-1
Parks, D.H., Chuvochina, M., Chaumeil, P.A., et al. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, 38(9), 1079-1086. DOI: 10.1038/s41587-020-0501-8
Quast, C., Pruesse, E., Yilmaz, P., et al. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research, 41(D1), D590-D596. DOI: 10.1093/nar/gks1219
Zhou, Z., Ji, Y., Li, W., et al. (2024). DNABERT-2: Efficient foundation model and benchmark for multi-species genomes. International Conference on Learning Representations. arXiv:2306.15006

Contact

For support, please open an issue on the GitHub repository.

Acknowledgments

Hugging Face, for hosting datasets and models.
The High-Performance Computing Team of the School of Sciences and Engineering at the American University in Cairo, for granting access to the GPU resources used in training.

Version history

v1 (June 2026). Added the V4 checkpoint (deeptaxa-v4-v1.pt), trained from scratch in the canonical compact HybridCNNBERT configuration on 274,509 in-silico V4 extractions (515F/806R, approximately 253 bp) from Greengenes2 2024.09. Single-seed (seed 42); species accuracy 82.84%, F1 80.16%, ECE 0.0256. The V4 amplicon was extracted at 99.0% yield, so the checkpoint keeps the full 16,909-species label space and matches the full-length parameter count (76.4 M).

v1 (April 2026). Initial release of the full-length and V3-V4 checkpoints. Both were updated in late April 2026 to the canonical SMALL HybridCNNBERT architecture (76.4 M and 75.8 M parameters respectively; kernels 3/5/7, 256 filters, 4 transformer layers, 7 attention heads, 3584 FFN intermediate, 896 hidden, dropout 0.20). The full-length update (v1.1) matched or beat the prior full-length numbers at every taxonomic rank with roughly 32% fewer parameters and roughly half the training time. The V3-V4 update (v1.2) achieved equivalent species-level performance (Acc 87.55% vs 87.52%, F1 85.92% vs 85.79%) at roughly 24% fewer parameters, harmonizing the two checkpoints under the same architecture. Users who downloaded either checkpoint before the corresponding update may see different SHA-256 hashes; re-downloading retrieves the updated file.

Downloads last month: 35

Model tree for systems-genomics-lab/deeptaxa

Base model

zhihan1996/DNABERT-2-117M

Finetuned

(31)

this model

Dataset used to train systems-genomics-lab/deeptaxa

Paper for systems-genomics-lab/deeptaxa

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

Paper • 2306.15006 • Published Jun 26, 2023 • 2

Evaluation results

Domain Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

1.000
Phylum Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.997
Class Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.996
Order Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.991
Family Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.986
Genus Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.969
Species Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.929
Species F1 (weighted) on Greengenes2 (2024-09 Test Split, full-length 16S)
test set self-reported

0.920