algaGPT / README.md
GreenGenomicsLab's picture
Update README.md
e02ea3d verified
metadata
license: mit
tags:
  - biology
  - protein-classification
  - microalgae
  - genomics
  - nanoGPT
datasets:
  - custom
pipeline_tag: text-classification

Model Card: algaGPT

Overview

Name: algaGPT Type: Causal language model for protein sequence classification Base: nanoGPT (Andrej Karpathy) Task: Binary classification of microalgal vs. contaminant protein sequences Mode: TI-inclusive (full-length sequences)

Training

  • Data: ~58.6M protein sequences (1:1 algal:contaminant ratio)
  • Algal sources: 166 microalgal genomes across 10 phyla
  • Contaminant sources: Bacterial, archaeal, and fungal sequences from NCBI nr

Performance

Metric Score
Recall >99%
Speed vs BLASTP+ ~10,701× faster

Usage

Input a protein sequence; model generates a classification tag (algal/conta (contaminant)) via next-token prediction.

Citation

Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. Patterns. 2024;6(11).

Contact

Kourosh Salehi-Ashtiani ksa3@nyu.edu