ArpanSarkar commited on
Commit
61510cc
·
verified ·
1 Parent(s): 1d280eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -7
README.md CHANGED
@@ -6,14 +6,13 @@ colorTo: gray
6
  sdk: docker
7
  pinned: true
8
  ---
9
- **NOTE:** PSALM-1 has not been trained on all Pfam families, as it has been trained and benchmarked on highly-curated datasets with strict sequence similarity guarantees between train and test data. **PSALM-1b (trained on all families in Pfam 35.0) is coming soon**
10
 
11
  # PSALM
12
- This repository contains code and pre-trained weights for Protein Sequence Annotation with Language Models (PSALM) from our [2024 preprint](https://www.biorxiv.org/content/10.1101/2024.06.04.596712v1).
13
 
14
  ## Abstract
15
- Protein function inference relies on annotating protein domains via sequence similarity, often modeled through profile Hidden Markov Models (profile HMMs), which capture evolutionary diversity within related domains. However, profile HMMs make strong simplifying independence assumptions when modeling residues in a sequence. Here, we introduce PSALM (Protein Sequence Annotation with Language Models), a hierarchical approach that relaxes these assumptions and uses representations of protein sequences learned by protein language models to enable high-sensitivity, high-specificity residue-level protein sequence annotation. We validate PSALM's performance on a curated set of "ground truth" annotations determined by a profile HMM-based method and highlight PSALM as a promising alternative for protein sequence annotation.
16
-
17
  ## Usage
18
  PSALM requires Python>=3.10 and PyTorch>=2.2.0. Start a fresh conda environment to use PSALM:
19
 
@@ -35,8 +34,8 @@ import torch
35
  from psalm import psalm
36
 
37
  # Load PSALM clan and fam models
38
- PSALM = psalm(clan_model_name="ProteinSequenceAnnotation/PSALM-1-clan",
39
- fam_model_name="ProteinSequenceAnnotation/PSALM-1-family",
40
  device = 'cpu') #cpu by default, replace with 'cuda' or 'mps' as needed
41
 
42
  # Prepare data (use PSALM.read_fasta(fasta_file_path) to get data directly from a FASTA file)
@@ -47,6 +46,17 @@ data = [
47
 
48
  # Visualize PSALM annotations (add optional save_path argument: PSALM.annotate(data,save_path="save_folder")
49
  PSALM.annotate(data)
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
52
  ## Cite
@@ -56,7 +66,7 @@ If you find PSALM useful in your research, please cite the following paper:
56
  author = {Sarkar, Arpan and Krishnan, Kumaresh and Eddy, Sean R},
57
  title = {Protein Sequence Domain Annotation using Language Models},
58
  year = {2024},
59
- URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596712},
60
  journal = {bioRxiv}
61
  }
62
 
 
6
  sdk: docker
7
  pinned: true
8
  ---
9
+ **NOTE:** We encourage you to use the PSALM-1b-clan and PSALM-1b-family models, as they are trained on the entirety of Pfam-A Seed 35.0 and can predict across all 19,632 Pfam families and all 655 Pfam clans in Pfam 35.0.
10
 
11
  # PSALM
12
+ This repository contains code and pre-trained weights for Protein Sequence Annotation with Language Models (PSALM) from our [2024 preprint](https://www.biorxiv.org/content/10.1101/2024.06.04.596712v2).
13
 
14
  ## Abstract
15
+ Protein function inference relies on annotating protein domains via sequence similarity, often modeled through profile Hidden Markov Models (profile HMMs), which capture evolutionary diversity within related domains. However, profile HMMs make strong simplifying independence assumptions when modeling residues in a sequence. Here, we introduce PSALM (Protein Sequence Annotation using Language Models), a hierarchical approach that relaxes these assumptions and uses representations of protein sequences learned by protein language models to enable high-sensitivity, high-specificity residue-level protein sequence annotation. We also develop the Multi-Domain Protein Homology Benchmark (MDPH-Bench), a benchmark for protein sequence domain annotation, where training and test sequences have been rigorously split to share no similarity between any of their domains at a given threshold of sequence identity. Prior benchmarks, which split one domain family at a time, do not support methods for annotating multi-domain proteins, where training and test sequences need to have multiple domains from different families. We validate PSALM’s performance on MDPH-Bench and highlight PSALM as a promising alternative to HMMER, a state-of-the-art profile HMM-based method, for protein sequence annotation.
 
16
  ## Usage
17
  PSALM requires Python>=3.10 and PyTorch>=2.2.0. Start a fresh conda environment to use PSALM:
18
 
 
34
  from psalm import psalm
35
 
36
  # Load PSALM clan and fam models
37
+ PSALM = psalm(clan_model_name="ProteinSequenceAnnotation/PSALM-1b-clan",
38
+ fam_model_name="ProteinSequenceAnnotation/PSALM-1b-family",
39
  device = 'cpu') #cpu by default, replace with 'cuda' or 'mps' as needed
40
 
41
  # Prepare data (use PSALM.read_fasta(fasta_file_path) to get data directly from a FASTA file)
 
46
 
47
  # Visualize PSALM annotations (add optional save_path argument: PSALM.annotate(data,save_path="save_folder")
48
  PSALM.annotate(data)
49
+
50
+ # Generate predictions without visualization
51
+ predictions = PSALM.predict(data)
52
+
53
+ # Access predictions
54
+ for seq_name, pred in predictions.items():
55
+ print(f"Sequence: {seq_name}")
56
+ print("Clan Labels:", pred['clan']['labels'])
57
+ print("Clan Probabilities:", pred['clan']['probs'])
58
+ print("Family Labels:", pred['family']['labels'])
59
+ print("Family Probabilities:", pred['family']['probs'])
60
  ```
61
 
62
  ## Cite
 
66
  author = {Sarkar, Arpan and Krishnan, Kumaresh and Eddy, Sean R},
67
  title = {Protein Sequence Domain Annotation using Language Models},
68
  year = {2024},
69
+ URL = {https://www.biorxiv.org/content/10.1101/2024.06.04.596712v2},
70
  journal = {bioRxiv}
71
  }
72