File size: 5,680 Bytes
40a4b1e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
language: en
tags:
- bioinformatics
- microbiology
- microbiome
- taxonomy-classification
- deep-learning
- 16s-rrna
datasets:
- systems-genomics-lab/greengenes
metrics:
- accuracy
- precision
- recall
- f1
license: mit
model-index:
- name: DeepTaxa Hybrid CNN-BERT (April 2025)
results:
- task:
type: classification
name: Hierarchical Taxonomy Classification
dataset:
type: systems-genomics-lab/greengenes
name: Greengenes (2024-09 Validation Split)
split: validation
metrics:
- type: accuracy
value: 0.9999258655200534
name: Domain Accuracy
- type: accuracy
value: 0.9992339437072182
name: Phylum Accuracy
- type: accuracy
value: 0.9988879828008006
name: Class Accuracy
- type: accuracy
value: 0.9971581782687128
name: Order Accuracy
- type: accuracy
value: 0.9950824128302074
name: Family Accuracy
- type: accuracy
value: 0.9833444535053253
name: Genus Accuracy
- type: accuracy
value: 0.9528751822472632
name: Species Accuracy
---
# DeepTaxa: Hybrid CNN-BERT Model (April 2025)
**DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
## Model Details
- **Architecture**: HybridCNNBERTClassifier (CNN + BERT)
- **Tokenizer**: `zhihan1996/DNABERT-2-117M`
- **Training Data**: Greengenes dataset (2024-09 split)
- **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
- **Total Parameters**: 72,635,154
- **Max Sequence Length**: 512
- **Dropout Probability**: 0.2
- **License**: MIT
- **Version**: April 2025
- **File**: `deeptaxa_april_2025.pt`
## Usage
### Download the Model
To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository:
- **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB).
- **Command Line (wget)**:
```bash
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
```
- **Command Line (git clone)**:
```bash
git clone https://huggingface.co/systems-genomics-lab/deeptaxa
cd deeptaxa
# The model file is now in the current directory
```
### Run Predictions
Once downloaded, use the model with the DeepTaxa CLI:
```bash
python -m deeptaxa.cli predict \
--fasta-file /path/to/sequences.fna.gz \
--checkpoint deeptaxa_april_2025.pt
```
Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa).
## Training Details
- **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`)
- **Hyperparameters**:
- Learning Rate: 0.0001
- Batch Size: 16
- Epochs: 10
- Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
- Focal Loss Gamma: 2.0
- Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
- **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
- **Timestamp**: Trained on 2025-04-04
## Performance
Validation metrics (on 40,467 sequences):
| Level | Accuracy | Precision | Recall | F1-Score |
|----------|----------|-----------|--------|----------|
| Domain | 99.99% | 99.99% | 99.99% | 99.99% |
| Phylum | 99.92% | 99.92% | 99.92% | 99.92% |
| Class | 99.89% | 99.85% | 99.89% | 99.87% |
| Order | 99.72% | 99.64% | 99.72% | 99.67% |
| Family | 99.51% | 99.32% | 99.51% | 99.40% |
| Genus | 98.33% | 97.89% | 98.33% | 98.01% |
| Species | 95.29% | 94.34% | 95.29% | 94.56% |
- **Training Loss**: 0.283
- **Validation Loss**: 0.606
## Intended Use
- Taxonomy classification in microbiome research and microbial ecology.
## Limitations
- GPU recommended (trained on NVIDIA A40).
- Lower precision at species level due to label complexity (10,547 classes).
## Citation
If you use this model in your research, please cite:
```bibtex
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
```
## Contact
Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support.
## Acknowledgements
- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
- **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models.
- **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.
|