GreenGenomicsLab commited on
Commit
3eca795
·
verified ·
1 Parent(s): e02ea3d

Update model card: full Patterns author list, ELF-NET application note

Browse files
Files changed (1) hide show
  1. README.md +39 -22
README.md CHANGED
@@ -1,29 +1,26 @@
1
  ---
2
  license: mit
3
  tags:
4
- - biology
5
- - protein-classification
6
- - microalgae
7
- - genomics
8
- - nanoGPT
9
- datasets:
10
- - custom
11
  pipeline_tag: text-classification
12
  ---
13
 
14
- # Model Card: algaGPT
15
 
16
- ## Overview
17
 
18
- **Name:** algaGPT
19
- **Type:** Causal language model for protein sequence classification
20
- **Base:** nanoGPT (Andrej Karpathy)
21
- **Task:** Binary classification of microalgal vs. contaminant protein sequences
22
- **Mode:** TI-inclusive (full-length sequences)
23
 
24
- ## Training
25
-
26
- - **Data:** ~58.6M protein sequences (1:1 algal:contaminant ratio)
 
27
  - **Algal sources:** 166 microalgal genomes across 10 phyla
28
  - **Contaminant sources:** Bacterial, archaeal, and fungal sequences from NCBI nr
29
 
@@ -32,18 +29,38 @@ pipeline_tag: text-classification
32
  | Metric | Score |
33
  |--------|-------|
34
  | Recall | >99% |
35
- | Speed vs BLASTP+ | ~10,701× faster |
36
 
37
  ## Usage
38
 
39
- Input a protein sequence; model generates a classification tag (algal/conta (contaminant)) via next-token prediction.
 
 
 
 
 
 
 
 
 
40
 
 
41
 
42
  ## Citation
43
 
44
- Nelson DR, Jaiswal AK, Ismail NS, Mystikou A, Salehi-Ashtiani K. Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras. *Patterns*. 2024;6(11).
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Contact
47
 
48
- Kourosh Salehi-Ashtiani
49
- ksa3@nyu.edu
 
1
  ---
2
  license: mit
3
  tags:
4
+ - biology
5
+ - protein-classification
6
+ - microalgae
7
+ - genomics
8
+ - nanoGPT
9
+ - metagenomics
10
+ - tara-oceans
11
  pipeline_tag: text-classification
12
  ---
13
 
14
+ # algaGPT
15
 
16
+ Causal language model for binary classification of microalgal vs. contaminant protein sequences.
17
 
18
+ ## Model Description
 
 
 
 
19
 
20
+ - **Architecture:** nanoGPT (Andrej Karpathy)
21
+ - **Task:** Binary classification of microalgal protein sequences via next-token prediction
22
+ - **Mode:** TI-inclusive (full-length sequences)
23
+ - **Training data:** ~58.6M protein sequences (1:1 algal:contaminant ratio)
24
  - **Algal sources:** 166 microalgal genomes across 10 phyla
25
  - **Contaminant sources:** Bacterial, archaeal, and fungal sequences from NCBI nr
26
 
 
29
  | Metric | Score |
30
  |--------|-------|
31
  | Recall | >99% |
32
+ | Speed vs. BLASTp | ~10,701x faster |
33
 
34
  ## Usage
35
 
36
+ **Input:** Protein sequence (amino acid string)
37
+ **Output:** Classification tag (algal/contaminant) via next-token prediction
38
+
39
+ ## Applications
40
+
41
+ algaGPT was used as the primary proteome extraction tool in the ELF-NET study (Nelson et al., forthcoming), where it purified algal protein sequences from 2,044 TARA Oceans metagenome assemblies, yielding 221.9 million sequences for downstream domain-environment coupling analysis.
42
+
43
+ ## Authors
44
+
45
+ David R. Nelson, Ashish Kumar Jaiswal, Noha Samir Ismail, Alexandra Mystikou, Kourosh Salehi-Ashtiani
46
 
47
+ Green Genomics Lab, New York University Abu Dhabi
48
 
49
  ## Citation
50
 
51
+ ```bibtex
52
+ @article{la4sr2025,
53
+ title={Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras},
54
+ author={Nelson, David R. and Jaiswal, Ashish Kumar and Ismail, Noha Samir and Mystikou, Alexandra and Salehi-Ashtiani, Kourosh},
55
+ journal={Patterns},
56
+ volume={6},
57
+ pages={101373},
58
+ year={2025},
59
+ publisher={Cell Press},
60
+ doi={10.1016/j.patter.2025.101373}
61
+ }
62
+ ```
63
 
64
  ## Contact
65
 
66
+ Kourosh Salehi-Ashtiani — ksa3@nyu.edu