--- license: mit tags: - biology - protein-classification - microalgae - genomics - nanoGPT datasets: - custom pipeline_tag: text-classification --- # Model Card: algaGPT ## Overview **Name:** algaGPT **Type:** Causal language model for protein sequence classification **Base:** nanoGPT (Andrej Karpathy) **Task:** Binary classification of microalgal vs. contaminant protein sequences **Mode:** TI-inclusive (full-length sequences) ## Training - **Data:** ~58.6M protein sequences (1:1 algal:contaminant ratio) - **Algal sources:** 166 microalgal genomes across 10 phyla - **Contaminant sources:** Bacterial, archaeal, and fungal sequences from NCBI nr ## Performance | Metric | Score | |--------|-------| | Recall | >99% | | Speed vs BLASTP+ | ~10,701× faster | ## Usage Input a protein sequence; model generates a classification tag (algal/conta (contaminant)) via next-token prediction. ## Citation Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. *Patterns*. 2024;6(11). ## Contact Kourosh Salehi-Ashtiani ksa3@nyu.edu