GreenGenomicsLab commited on
Commit
8cb4404
·
verified ·
1 Parent(s): bf27627

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: algaGPT
2
+
3
+ ## Overview
4
+
5
+ **Name:** algaGPT
6
+ **Type:** Causal language model for protein sequence classification
7
+ **Base:** nanoGPT (Andrej Karpathy)
8
+ **Task:** Binary classification of microalgal vs. contaminant protein sequences
9
+ **Mode:** TI-inclusive (full-length sequences)
10
+
11
+ ## Training
12
+
13
+ - **Data:** ~58.6M protein sequences (1:1 algal:contaminant ratio)
14
+ - **Algal sources:** 166 microalgal genomes across 10 phyla
15
+ - **Contaminant sources:** Bacterial, archaeal, and fungal sequences from NCBI nr
16
+
17
+ ## Performance
18
+
19
+ | Metric | Score |
20
+ |--------|-------|
21
+ | Recall | >99% |
22
+ | Speed vs BLASTP+ | > 10,701x faster |
23
+
24
+ ## Usage
25
+
26
+ Input a protein sequence; model generates a classification tag (algal/bacterial) via next-token prediction.
27
+
28
+ ## Limitations
29
+
30
+ - Under-representation of some algal lineages (dinoflagellates, rhodophytes, Chromerida)
31
+ - Up to ~10% false positive rate for species with complex endosymbiotic histories
32
+ - Does not classify eukaryotic protist contaminants
33
+
34
+ ## Citation
35
+
36
+ Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. *Patterns*. 2024;6(11).
37
+
38
+ ## Contact
39
+
40
+ Kourosh Salehi-Ashtiani
41
+ ksa3@nyu.edu