deeptaxa / README.md
Ahmed Moustafa
Initial commit
40a4b1e
---
language: en
tags:
- bioinformatics
- microbiology
- microbiome
- taxonomy-classification
- deep-learning
- 16s-rrna
datasets:
- systems-genomics-lab/greengenes
metrics:
- accuracy
- precision
- recall
- f1
license: mit
model-index:
- name: DeepTaxa Hybrid CNN-BERT (April 2025)
results:
- task:
type: classification
name: Hierarchical Taxonomy Classification
dataset:
type: systems-genomics-lab/greengenes
name: Greengenes (2024-09 Validation Split)
split: validation
metrics:
- type: accuracy
value: 0.9999258655200534
name: Domain Accuracy
- type: accuracy
value: 0.9992339437072182
name: Phylum Accuracy
- type: accuracy
value: 0.9988879828008006
name: Class Accuracy
- type: accuracy
value: 0.9971581782687128
name: Order Accuracy
- type: accuracy
value: 0.9950824128302074
name: Family Accuracy
- type: accuracy
value: 0.9833444535053253
name: Genus Accuracy
- type: accuracy
value: 0.9528751822472632
name: Species Accuracy
---
# DeepTaxa: Hybrid CNN-BERT Model (April 2025)
**DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
## Model Details
- **Architecture**: HybridCNNBERTClassifier (CNN + BERT)
- **Tokenizer**: `zhihan1996/DNABERT-2-117M`
- **Training Data**: Greengenes dataset (2024-09 split)
- **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
- **Total Parameters**: 72,635,154
- **Max Sequence Length**: 512
- **Dropout Probability**: 0.2
- **License**: MIT
- **Version**: April 2025
- **File**: `deeptaxa_april_2025.pt`
## Usage
### Download the Model
To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository:
- **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB).
- **Command Line (wget)**:
```bash
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
```
- **Command Line (git clone)**:
```bash
git clone https://huggingface.co/systems-genomics-lab/deeptaxa
cd deeptaxa
# The model file is now in the current directory
```
### Run Predictions
Once downloaded, use the model with the DeepTaxa CLI:
```bash
python -m deeptaxa.cli predict \
--fasta-file /path/to/sequences.fna.gz \
--checkpoint deeptaxa_april_2025.pt
```
Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa).
## Training Details
- **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`)
- **Hyperparameters**:
- Learning Rate: 0.0001
- Batch Size: 16
- Epochs: 10
- Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
- Focal Loss Gamma: 2.0
- Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
- **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
- **Timestamp**: Trained on 2025-04-04
## Performance
Validation metrics (on 40,467 sequences):
| Level | Accuracy | Precision | Recall | F1-Score |
|----------|----------|-----------|--------|----------|
| Domain | 99.99% | 99.99% | 99.99% | 99.99% |
| Phylum | 99.92% | 99.92% | 99.92% | 99.92% |
| Class | 99.89% | 99.85% | 99.89% | 99.87% |
| Order | 99.72% | 99.64% | 99.72% | 99.67% |
| Family | 99.51% | 99.32% | 99.51% | 99.40% |
| Genus | 98.33% | 97.89% | 98.33% | 98.01% |
| Species | 95.29% | 94.34% | 95.29% | 94.56% |
- **Training Loss**: 0.283
- **Validation Loss**: 0.606
## Intended Use
- Taxonomy classification in microbiome research and microbial ecology.
## Limitations
- GPU recommended (trained on NVIDIA A40).
- Lower precision at species level due to label complexity (10,547 classes).
## Citation
If you use this model in your research, please cite:
```bibtex
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2025},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa},
}
```
## Contact
Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support.
## Acknowledgements
- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
- **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models.
- **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.