|
|
---
|
|
|
language: en
|
|
|
tags:
|
|
|
- bioinformatics
|
|
|
- microbiology
|
|
|
- microbiome
|
|
|
- taxonomy-classification
|
|
|
- deep-learning
|
|
|
- 16s-rrna
|
|
|
datasets:
|
|
|
- systems-genomics-lab/greengenes
|
|
|
metrics:
|
|
|
- accuracy
|
|
|
- precision
|
|
|
- recall
|
|
|
- f1
|
|
|
license: mit
|
|
|
model-index:
|
|
|
- name: DeepTaxa Hybrid CNN-BERT (April 2025)
|
|
|
results:
|
|
|
- task:
|
|
|
type: classification
|
|
|
name: Hierarchical Taxonomy Classification
|
|
|
dataset:
|
|
|
type: systems-genomics-lab/greengenes
|
|
|
name: Greengenes (2024-09 Validation Split)
|
|
|
split: validation
|
|
|
metrics:
|
|
|
- type: accuracy
|
|
|
value: 0.9999258655200534
|
|
|
name: Domain Accuracy
|
|
|
- type: accuracy
|
|
|
value: 0.9992339437072182
|
|
|
name: Phylum Accuracy
|
|
|
- type: accuracy
|
|
|
value: 0.9988879828008006
|
|
|
name: Class Accuracy
|
|
|
- type: accuracy
|
|
|
value: 0.9971581782687128
|
|
|
name: Order Accuracy
|
|
|
- type: accuracy
|
|
|
value: 0.9950824128302074
|
|
|
name: Family Accuracy
|
|
|
- type: accuracy
|
|
|
value: 0.9833444535053253
|
|
|
name: Genus Accuracy
|
|
|
- type: accuracy
|
|
|
value: 0.9528751822472632
|
|
|
name: Species Accuracy
|
|
|
---
|
|
|
|
|
|
# DeepTaxa: Hybrid CNN-BERT Model (April 2025)
|
|
|
|
|
|
**DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
|
|
|
|
|
|
## Model Details
|
|
|
- **Architecture**: HybridCNNBERTClassifier (CNN + BERT)
|
|
|
- **Tokenizer**: `zhihan1996/DNABERT-2-117M`
|
|
|
- **Training Data**: Greengenes dataset (2024-09 split)
|
|
|
- **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
|
|
|
- **Total Parameters**: 72,635,154
|
|
|
- **Max Sequence Length**: 512
|
|
|
- **Dropout Probability**: 0.2
|
|
|
- **License**: MIT
|
|
|
- **Version**: April 2025
|
|
|
- **File**: `deeptaxa_april_2025.pt`
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
### Download the Model
|
|
|
To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository:
|
|
|
|
|
|
- **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB).
|
|
|
- **Command Line (wget)**:
|
|
|
```bash
|
|
|
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
|
|
|
```
|
|
|
- **Command Line (git clone)**:
|
|
|
```bash
|
|
|
git clone https://huggingface.co/systems-genomics-lab/deeptaxa
|
|
|
cd deeptaxa
|
|
|
# The model file is now in the current directory
|
|
|
```
|
|
|
|
|
|
### Run Predictions
|
|
|
Once downloaded, use the model with the DeepTaxa CLI:
|
|
|
```bash
|
|
|
python -m deeptaxa.cli predict \
|
|
|
--fasta-file /path/to/sequences.fna.gz \
|
|
|
--checkpoint deeptaxa_april_2025.pt
|
|
|
```
|
|
|
|
|
|
Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa).
|
|
|
|
|
|
## Training Details
|
|
|
- **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`)
|
|
|
- **Hyperparameters**:
|
|
|
- Learning Rate: 0.0001
|
|
|
- Batch Size: 16
|
|
|
- Epochs: 10
|
|
|
- Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
|
|
|
- Focal Loss Gamma: 2.0
|
|
|
- Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
|
|
|
- **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
|
|
|
- **Timestamp**: Trained on 2025-04-04
|
|
|
|
|
|
## Performance
|
|
|
Validation metrics (on 40,467 sequences):
|
|
|
| Level | Accuracy | Precision | Recall | F1-Score |
|
|
|
|----------|----------|-----------|--------|----------|
|
|
|
| Domain | 99.99% | 99.99% | 99.99% | 99.99% |
|
|
|
| Phylum | 99.92% | 99.92% | 99.92% | 99.92% |
|
|
|
| Class | 99.89% | 99.85% | 99.89% | 99.87% |
|
|
|
| Order | 99.72% | 99.64% | 99.72% | 99.67% |
|
|
|
| Family | 99.51% | 99.32% | 99.51% | 99.40% |
|
|
|
| Genus | 98.33% | 97.89% | 98.33% | 98.01% |
|
|
|
| Species | 95.29% | 94.34% | 95.29% | 94.56% |
|
|
|
- **Training Loss**: 0.283
|
|
|
- **Validation Loss**: 0.606
|
|
|
|
|
|
|
|
|
## Intended Use
|
|
|
- Taxonomy classification in microbiome research and microbial ecology.
|
|
|
|
|
|
## Limitations
|
|
|
- GPU recommended (trained on NVIDIA A40).
|
|
|
- Lower precision at species level due to label complexity (10,547 classes).
|
|
|
|
|
|
## Citation
|
|
|
If you use this model in your research, please cite:
|
|
|
```bibtex
|
|
|
@software{DeepTaxa,
|
|
|
author = {{Systems Genomics Lab}},
|
|
|
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
|
|
|
year = {2025},
|
|
|
publisher = {GitHub},
|
|
|
url = {https://github.com/systems-genomics-lab/deeptaxa},
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Contact
|
|
|
Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support.
|
|
|
|
|
|
## Acknowledgements
|
|
|
- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
|
|
|
- **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models.
|
|
|
- **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.
|
|
|
|