File size: 5,680 Bytes
40a4b1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---

language: en
tags:
  - bioinformatics
  - microbiology
  - microbiome
  - taxonomy-classification
  - deep-learning
  - 16s-rrna
datasets:
  - systems-genomics-lab/greengenes
metrics:
  - accuracy
  - precision
  - recall
  - f1
license: mit
model-index:
  - name: DeepTaxa Hybrid CNN-BERT (April 2025)
    results:
      - task:
          type: classification
          name: Hierarchical Taxonomy Classification
        dataset:
          type: systems-genomics-lab/greengenes
          name: Greengenes (2024-09 Validation Split)
          split: validation
        metrics:
          - type: accuracy
            value: 0.9999258655200534
            name: Domain Accuracy
          - type: accuracy
            value: 0.9992339437072182
            name: Phylum Accuracy
          - type: accuracy
            value: 0.9988879828008006
            name: Class Accuracy
          - type: accuracy
            value: 0.9971581782687128
            name: Order Accuracy
          - type: accuracy
            value: 0.9950824128302074
            name: Family Accuracy
          - type: accuracy
            value: 0.9833444535053253
            name: Genus Accuracy
          - type: accuracy
            value: 0.9528751822472632
            name: Species Accuracy
---


# DeepTaxa: Hybrid CNN-BERT Model (April 2025)

**DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.

## Model Details
- **Architecture**: HybridCNNBERTClassifier (CNN + BERT)
- **Tokenizer**: `zhihan1996/DNABERT-2-117M`
- **Training Data**: Greengenes dataset (2024-09 split)
- **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
- **Total Parameters**: 72,635,154
- **Max Sequence Length**: 512
- **Dropout Probability**: 0.2
- **License**: MIT
- **Version**: April 2025
- **File**: `deeptaxa_april_2025.pt`

## Usage

### Download the Model
To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository:

- **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB).
- **Command Line (wget)**:
  ```bash

  wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt

  ```
- **Command Line (git clone)**:
  ```bash

  git clone https://huggingface.co/systems-genomics-lab/deeptaxa

  cd deeptaxa

  # The model file is now in the current directory

  ```

### Run Predictions
Once downloaded, use the model with the DeepTaxa CLI:
```bash

python -m deeptaxa.cli predict \

  --fasta-file /path/to/sequences.fna.gz \

  --checkpoint deeptaxa_april_2025.pt

```

Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa).

## Training Details
- **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`)
- **Hyperparameters**:
  - Learning Rate: 0.0001
  - Batch Size: 16
  - Epochs: 10
  - Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)

  - Focal Loss Gamma: 2.0

  - Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]

- **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU

- **Timestamp**: Trained on 2025-04-04



## Performance

Validation metrics (on 40,467 sequences):

| Level    | Accuracy | Precision | Recall | F1-Score |

|----------|----------|-----------|--------|----------|

| Domain   | 99.99%   | 99.99%    | 99.99% | 99.99%   |

| Phylum   | 99.92%   | 99.92%    | 99.92% | 99.92%   |

| Class    | 99.89%   | 99.85%    | 99.89% | 99.87%   |

| Order    | 99.72%   | 99.64%    | 99.72% | 99.67%   |

| Family   | 99.51%   | 99.32%    | 99.51% | 99.40%   |

| Genus    | 98.33%   | 97.89%    | 98.33% | 98.01%   |

| Species  | 95.29%   | 94.34%    | 95.29% | 94.56%   |

- **Training Loss**: 0.283

- **Validation Loss**: 0.606





## Intended Use

- Taxonomy classification in microbiome research and microbial ecology.



## Limitations

- GPU recommended (trained on NVIDIA A40).

- Lower precision at species level due to label complexity (10,547 classes).



## Citation

If you use this model in your research, please cite:

```bibtex

@software{DeepTaxa,

  author = {{Systems Genomics Lab}},

  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},

  year = {2025},

  publisher = {GitHub},

  url = {https://github.com/systems-genomics-lab/deeptaxa},

}

```



## Contact

Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support.



## Acknowledgements

- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.

- **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models.

- **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.