Update README.md
Browse files
README.md
CHANGED
|
@@ -4,18 +4,17 @@ license: cc-by-nc-sa-4.0
|
|
| 4 |
# ProkBERT PhaStyle
|
| 5 |
|
| 6 |
**Model Name**: neuralbioinfo/PhaStyle-mini
|
| 7 |
-
**Model Type**: Genomic Language Model (
|
| 8 |
**Model Description**:
|
| 9 |
|
| 10 |
-
ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on
|
| 11 |
-
|
| 12 |
By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
|
| 13 |
|
| 14 |
### Key Points:
|
| 15 |
- **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
|
| 16 |
- **Segment Length** for training: 512 base pairs.
|
| 17 |
- **Output**: Binary classification (virulent or temperate).
|
| 18 |
-
- **Model Parameters**: ~
|
| 19 |
|
| 20 |
---
|
| 21 |
|
|
@@ -44,3 +43,29 @@ python bin/PhaStyle.py \
|
|
| 44 |
--per_device_eval_batch_size 196
|
| 45 |
|
| 46 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
# ProkBERT PhaStyle
|
| 5 |
|
| 6 |
**Model Name**: neuralbioinfo/PhaStyle-mini
|
| 7 |
+
**Model Type**: Genomic Language Model (ProkBERT-based)
|
| 8 |
**Model Description**:
|
| 9 |
|
| 10 |
+
ProkBERT PhaStyle is a fine-tuned genomic language model designed for phage lifestyle prediction. It classifies phages as either **virulent** or **temperate** directly from nucleotide sequences. The model is based on ProkBERT architecture and was trained on the **BACPHLIP dataset**, excluding *E. coli* sequences
|
|
|
|
| 11 |
By leveraging transfer learning, ProkBERT PhaStyle is optimized for handling **fragmented sequences**, commonly encountered in metagenomic and metavirome datasets. The model provides a fast, efficient alternative to traditional methods without requiring complex preprocessing pipelines or curated databases.
|
| 12 |
|
| 13 |
### Key Points:
|
| 14 |
- **Trained on BACPHLIP** dataset excluding *E. coli* sequences.
|
| 15 |
- **Segment Length** for training: 512 base pairs.
|
| 16 |
- **Output**: Binary classification (virulent or temperate).
|
| 17 |
+
- **Model Parameters**: ~25 million parameters.
|
| 18 |
|
| 19 |
---
|
| 20 |
|
|
|
|
| 43 |
--per_device_eval_batch_size 196
|
| 44 |
|
| 45 |
```
|
| 46 |
+
|
| 47 |
+
### Datasets Used:
|
| 48 |
+
|
| 49 |
+
- **BACPHLIP (without E. coli)**: 1,868 training sequences and 246 validation sequences.
|
| 50 |
+
- **Guelin Collection**: 394 *Escherichia* phages (temperate and virulent types).
|
| 51 |
+
- **EXTREMOPHILE Phages**: 16 phages isolated from extreme environments, including deep-sea, acidic, and arsenic-rich habitats.
|
| 52 |
+
|
| 53 |
+
Each dataset was processed using **512bp segment lengths** to simulate fragmented metagenomic assemblies.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Performance
|
| 58 |
+
|
| 59 |
+
ProkBERT PhaStyle outperforms state-of-the-art models, especially in generalization and speed. It has been benchmarked on **short fragments** (512bp) and **phages from unseen environments**, demonstrating its robustness for both environmental and clinical datasets.
|
| 60 |
+
|
| 61 |
+
### Key Metrics:
|
| 62 |
+
- **Balanced Accuracy**: 0.94 (on 1022bp fragments from the *Escherichia* dataset)
|
| 63 |
+
- **MCC (Matthews Correlation Coefficient)**: 0.91
|
| 64 |
+
- **Sensitivity**: 0.97
|
| 65 |
+
- **Specificity**: 0.91
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Limitations
|
| 70 |
+
|
| 71 |
+
ProkBERT PhaStyle is specifically designed for **binary classification** of phage lifestyles (virulent vs. temperate) and does not handle non-phage sequences. It is recommended to use this model in conjunction with upstream pipelines that identify phage sequences. For large-scale inference, **GPU support** is strongly advised.
|