|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- biology |
|
|
- protein-classification |
|
|
- microalgae |
|
|
- genomics |
|
|
- nanoGPT |
|
|
datasets: |
|
|
- custom |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Model Card: algaGPT |
|
|
|
|
|
## Overview |
|
|
|
|
|
**Name:** algaGPT |
|
|
**Type:** Causal language model for protein sequence classification |
|
|
**Base:** nanoGPT (Andrej Karpathy) |
|
|
**Task:** Binary classification of microalgal vs. contaminant protein sequences |
|
|
**Mode:** TI-inclusive (full-length sequences) |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Data:** ~58.6M protein sequences (1:1 algal:contaminant ratio) |
|
|
- **Algal sources:** 166 microalgal genomes across 10 phyla |
|
|
- **Contaminant sources:** Bacterial, archaeal, and fungal sequences from NCBI nr |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Recall | >99% | |
|
|
| Speed vs BLASTP+ | ~10,701× faster | |
|
|
|
|
|
## Usage |
|
|
|
|
|
Input a protein sequence; model generates a classification tag (algal/conta (contaminant)) via next-token prediction. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. *Patterns*. 2024;6(11). |
|
|
|
|
|
## Contact |
|
|
|
|
|
Kourosh Salehi-Ashtiani |
|
|
ksa3@nyu.edu |
|
|
|