DeepTaxa: Hybrid CNN-BERT Model

DeepTaxa is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts pre-trained hybrid CNN-BERT models, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.

Available Models

File Version Training Date Loss Function Dataset Size Species (weighted F1)
deeptaxa_v2.0.pt v2.0 (recommended) 2026-03-22 Cross-Entropy 277,336 seqs 0.9429
deeptaxa_v1.0.pt v1.0 2025-04-04 Focal 161,866 seqs 0.9457

v2.0 is the recommended model. It was trained on the full Greengenes 2024-09 dataset (277,336 sequences, 16,909 species) with cross-entropy loss, which was shown to outperform focal loss in controlled comparisons.

Model Details

  • Architecture: HybridCNNBERTClassifier (CNN + BERT)
  • Tokenizer: zhihan1996/DNABERT-2-117M
  • Training Data: Greengenes dataset (2024-09)
  • Levels Predicted: 7 (Domain: 2, Phylum: 129, Class: 349, Order: 997, Family: 2,250, Genus: 7,287, Species: 16,909)
  • Embedding Dimension: 896
  • Transformer Layers: 4 (7 attention heads)
  • CNN Filters: 256 (kernel sizes: 3, 5, 7)
  • Max Sequence Length: 512
  • Dropout Probability: 0.2
  • License: MIT

Usage

Download the Model

Download the recommended v2.0 model file deeptaxa_v2.0.pt from this repository:

  • Manual Download: Visit https://huggingface.co/systems-genomics-lab/deeptaxa, click on the "Files and versions" tab, and download deeptaxa_v2.0.pt (~912 MB).

  • Command Line (wget):

    wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_v2.0.pt
    
  • Command Line (git clone):

    git clone https://huggingface.co/systems-genomics-lab/deeptaxa
    cd deeptaxa
    # The model files are now in the current directory
    

Run Predictions

Once downloaded, use the model with the DeepTaxa CLI:

python -m deeptaxa.cli predict \
  --fasta-file /path/to/sequences.fna.gz \
  --checkpoint deeptaxa_v2.0.pt

Full instructions are available on the GitHub repository.

Training Details (v2.0)

  • Dataset: 277,336 total sequences (80/20 train/validation split) from Greengenes (gg_2024_09_training.fna.gz, gg_2024_09_training.tsv.gz). Test set: 69,335 held-out sequences.
  • Hyperparameters:
    • Learning Rate: 0.0005
    • Batch Size: 64
    • Epochs: 10
    • Optimizer: AdamW (lr=0.0005, betas=[0.9, 0.999], weight_decay=0.01)
    • Scheduler Warmup Ratio: 0.1
    • Loss: Cross-Entropy (uniform level weights)
    • Seed: 42
  • Training Time: ~3 hours 14 minutes on NVIDIA A40 GPU
  • Timestamp: Trained on 2026-03-22

Performance (v2.0)

Test set metrics (69,335 held-out sequences):

Level Accuracy F1-Score
Domain 99.98% 99.98%
Phylum 99.67% 99.66%
Class 99.62% 99.61%
Order 99.07% 99.04%
Family 98.59% 98.60%
Genus 96.86% 97.10%
Species 92.92% 94.29%
  • Training Loss: 0.059
  • Validation Loss: 1.345

Changes from v1.0 to v2.0

  • Loss function: Focal loss (gamma=2.0, hierarchical level weights) → Cross-entropy (uniform weights). CE was shown to outperform focal loss by ~0.5pp at species level in a controlled two-seed comparison.
  • Dataset: Expanded from 161,866 to 277,336 training sequences (16,909 species vs 10,547).
  • Hyperparameters: Batch size increased from 16 to 64; learning rate from 0.0001 to 0.0005.
  • Evaluation: v2.0 metrics are reported on a held-out test set (69,335 sequences), whereas v1.0 metrics were on a validation split.

Intended Use

  • Taxonomy classification in microbiome research and microbial ecology.

Limitations

  • GPU recommended (trained on NVIDIA A40).
  • Lower precision at species level due to label complexity (16,909 classes) and severe class imbalance (Gini coefficient = 0.82).
  • Singleton species (44.8% of all species) have limited predictive power.

Citation

If you use this model in your research, please cite:

@software{DeepTaxa,
  author = {{Systems Genomics Lab}},
  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/systems-genomics-lab/deeptaxa},
}

Contact

Open an issue on GitHub for support.

Acknowledgements

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for systems-genomics-lab/deeptaxa

Finetuned
(29)
this model

Dataset used to train systems-genomics-lab/deeptaxa

Evaluation results