DeepTaxa: Hybrid CNN-BERT Model
DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts pre-trained hybrid CNN-BERT models, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
Available Models
| File | Version | Training Date | Loss Function | Dataset Size | Species (weighted F1) |
|---|---|---|---|---|---|
deeptaxa_v2.0.pt |
v2.0 (recommended) | 2026-03-22 | Cross-Entropy | 277,336 seqs | 0.9429 |
deeptaxa_v1.0.pt |
v1.0 | 2025-04-04 | Focal | 161,866 seqs | 0.9457 |
v2.0 is the recommended model. It was trained on the full Greengenes 2024-09 dataset (277,336 sequences, 16,909 species) with cross-entropy loss, which was shown to outperform focal loss in controlled comparisons.
Model Details
- Architecture: HybridCNNBERTClassifier (CNN + BERT)
- Tokenizer:
zhihan1996/DNABERT-2-117M - Training Data: Greengenes dataset (2024-09)
- Levels Predicted: 7 (Domain: 2, Phylum: 129, Class: 349, Order: 997, Family: 2,250, Genus: 7,287, Species: 16,909)
- Embedding Dimension: 896
- Transformer Layers: 4 (7 attention heads)
- CNN Filters: 256 (kernel sizes: 3, 5, 7)
- Max Sequence Length: 512
- Dropout Probability: 0.2
- License: MIT
Usage
Download the Model
Download the recommended v2.0 model file deeptaxa_v2.0.pt from this repository:
Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download
deeptaxa_v2.0.pt(~912 MB).Command Line (wget):
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_v2.0.ptCommand Line (git clone):
git clone https://huggingface.co/systems-genomics-lab/deeptaxa cd deeptaxa # The model files are now in the current directory
Run Predictions
Once downloaded, use the model with the DeepTaxa CLI:
python -m deeptaxa.cli predict \
--fasta-file /path/to/sequences.fna.gz \
--checkpoint deeptaxa_v2.0.pt
Full instructions are available on the GitHub repository.
Training Details (v2.0)
- Dataset: 277,336 total sequences (80/20 train/validation split) from Greengenes (
gg_2024_09_training.fna.gz,gg_2024_09_training.tsv.gz). Test set: 69,335 held-out sequences. - Hyperparameters:
- Learning Rate: 0.0005
- Batch Size: 64
- Epochs: 10
- Optimizer: AdamW (lr=0.0005, betas=[0.9, 0.999], weight_decay=0.01)
- Scheduler Warmup Ratio: 0.1
- Loss: Cross-Entropy (uniform level weights)
- Seed: 42
- Training Time: ~3 hours 14 minutes on NVIDIA A40 GPU
- Timestamp: Trained on 2026-03-22
Performance (v2.0)
Test set metrics (69,335 held-out sequences):
| Level | Accuracy | F1-Score |
|---|---|---|
| Domain | 99.98% | 99.98% |
| Phylum | 99.67% | 99.66% |
| Class | 99.62% | 99.61% |
| Order | 99.07% | 99.04% |
| Family | 98.59% | 98.60% |
| Genus | 96.86% | 97.10% |
| Species | 92.92% | 94.29% |
- Training Loss: 0.059
- Validation Loss: 1.345
Changes from v1.0 to v2.0
- Loss function: Focal loss (gamma=2.0, hierarchical level weights) → Cross-entropy (uniform weights). CE was shown to outperform focal loss by ~0.5pp at species level in a controlled two-seed comparison.
- Dataset: Expanded from 161,866 to 277,336 training sequences (16,909 species vs 10,547).
- Hyperparameters: Batch size increased from 16 to 64; learning rate from 0.0001 to 0.0005.
- Evaluation: v2.0 metrics are reported on a held-out test set (69,335 sequences), whereas v1.0 metrics were on a validation split.
Intended Use
- Taxonomy classification in microbiome research and microbial ecology.
Limitations
- GPU recommended (trained on NVIDIA A40).
- Lower precision at species level due to label complexity (16,909 classes) and severe class imbalance (Gini coefficient = 0.82).
- Singleton species (44.8% of all species) have limited predictive power.
Citation
If you use this model in your research, please cite:
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
Contact
Open an issue on GitHub for support.
Acknowledgements
- Dr. Olaitan I. Awe and the Omics Codeathon team for their mentorship and contributions.
- Hugging Face for providing a platform to host datasets and models.
- The High-Performance Computing Team of the School of Sciences and Engineering (SSE) at the American University in Cairo (AUC) for their support and for granting access to GPU resources that enabled this work.
- Downloads last month
- 32
Model tree for systems-genomics-lab/deeptaxa
Base model
zhihan1996/DNABERT-2-117MDataset used to train systems-genomics-lab/deeptaxa
Evaluation results
- Domain Accuracy on Greengenes (2024-09 Test Split)test set self-reported1.000
- Phylum Accuracy on Greengenes (2024-09 Test Split)test set self-reported0.997
- Class Accuracy on Greengenes (2024-09 Test Split)test set self-reported0.996
- Order Accuracy on Greengenes (2024-09 Test Split)test set self-reported0.991
- Family Accuracy on Greengenes (2024-09 Test Split)test set self-reported0.986
- Genus Accuracy on Greengenes (2024-09 Test Split)test set self-reported0.969
- Species Accuracy on Greengenes (2024-09 Test Split)test set self-reported0.929
- Species Weighted F1 on Greengenes (2024-09 Test Split)test set self-reported0.943
- Domain Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported1.000
- Phylum Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.999
- Class Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.999
- Order Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.997
- Family Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.995
- Genus Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.983
- Species Accuracy on Greengenes (2024-09 Validation Split)validation set self-reported0.953