--- language: en tags: - bioinformatics - microbiology - microbiome - taxonomy-classification - deep-learning - 16s-rrna datasets: - systems-genomics-lab/greengenes metrics: - accuracy - precision - recall - f1 license: mit model-index: - name: DeepTaxa Hybrid CNN-BERT (April 2025) results: - task: type: classification name: Hierarchical Taxonomy Classification dataset: type: systems-genomics-lab/greengenes name: Greengenes (2024-09 Validation Split) split: validation metrics: - type: accuracy value: 0.9999258655200534 name: Domain Accuracy - type: accuracy value: 0.9992339437072182 name: Phylum Accuracy - type: accuracy value: 0.9988879828008006 name: Class Accuracy - type: accuracy value: 0.9971581782687128 name: Order Accuracy - type: accuracy value: 0.9950824128302074 name: Family Accuracy - type: accuracy value: 0.9833444535053253 name: Genus Accuracy - type: accuracy value: 0.9528751822472632 name: Species Accuracy --- # DeepTaxa: Hybrid CNN-BERT Model (April 2025) **DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species. ## Model Details - **Architecture**: HybridCNNBERTClassifier (CNN + BERT) - **Tokenizer**: `zhihan1996/DNABERT-2-117M` - **Training Data**: Greengenes dataset (2024-09 split) - **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547) - **Total Parameters**: 72,635,154 - **Max Sequence Length**: 512 - **Dropout Probability**: 0.2 - **License**: MIT - **Version**: April 2025 - **File**: `deeptaxa_april_2025.pt` ## Usage ### Download the Model To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository: - **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB). - **Command Line (wget)**: ```bash wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt ``` - **Command Line (git clone)**: ```bash git clone https://huggingface.co/systems-genomics-lab/deeptaxa cd deeptaxa # The model file is now in the current directory ``` ### Run Predictions Once downloaded, use the model with the DeepTaxa CLI: ```bash python -m deeptaxa.cli predict \ --fasta-file /path/to/sequences.fna.gz \ --checkpoint deeptaxa_april_2025.pt ``` Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa). ## Training Details - **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`) - **Hyperparameters**: - Learning Rate: 0.0001 - Batch Size: 16 - Epochs: 10 - Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01) - Focal Loss Gamma: 2.0 - Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0] - **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU - **Timestamp**: Trained on 2025-04-04 ## Performance Validation metrics (on 40,467 sequences): | Level | Accuracy | Precision | Recall | F1-Score | |----------|----------|-----------|--------|----------| | Domain | 99.99% | 99.99% | 99.99% | 99.99% | | Phylum | 99.92% | 99.92% | 99.92% | 99.92% | | Class | 99.89% | 99.85% | 99.89% | 99.87% | | Order | 99.72% | 99.64% | 99.72% | 99.67% | | Family | 99.51% | 99.32% | 99.51% | 99.40% | | Genus | 98.33% | 97.89% | 98.33% | 98.01% | | Species | 95.29% | 94.34% | 95.29% | 94.56% | - **Training Loss**: 0.283 - **Validation Loss**: 0.606 ## Intended Use - Taxonomy classification in microbiome research and microbial ecology. ## Limitations - GPU recommended (trained on NVIDIA A40). - Lower precision at species level due to label complexity (10,547 classes). ## Citation If you use this model in your research, please cite: ```bibtex @software{DeepTaxa, author = {{Systems Genomics Lab}}, title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning}, year = {2025}, publisher = {GitHub}, url = {https://github.com/systems-genomics-lab/deeptaxa}, } ``` ## Contact Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support. ## Acknowledgements - **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions. - **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models. - **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.