Spaces:
No application file
No application file
Update README.md
Browse files
README.md
CHANGED
|
@@ -4,7 +4,60 @@ emoji: 🔥
|
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: gray
|
| 6 |
sdk: docker
|
| 7 |
-
pinned:
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
colorFrom: red
|
| 5 |
colorTo: gray
|
| 6 |
sdk: docker
|
| 7 |
+
pinned: true
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# PSALM
|
| 11 |
+
This repository contains code and pre-trained weights for Protein Sequence Annotation with Language Models (PSALM) from our [2024 preprint](https://www.biorxiv.org/content/10.1101/2024.06.04.596712v1).
|
| 12 |
+
|
| 13 |
+
## Abstract
|
| 14 |
+
Protein function inference relies on annotating protein domains via sequence similarity, often modeled through profile Hidden Markov Models (profile HMMs), which capture evolutionary diversity within related domains. However, profile HMMs make strong simplifying independence assumptions when modeling residues in a sequence. Here, we introduce PSALM (Protein Sequence Annotation with Language Models), a hierarchical approach that relaxes these assumptions and uses representations of protein sequences learned by protein language models to enable high-sensitivity, high-specificity residue-level protein sequence annotation. We validate PSALM's performance on a curated set of "ground truth" annotations determined by a profile HMM-based method and highlight PSALM as a promising alternative for protein sequence annotation.
|
| 15 |
+
|
| 16 |
+
## Usage
|
| 17 |
+
PSALM requires Python>=3.10 and PyTorch>=2.2.0. Start a fresh conda environment to use PSALM:
|
| 18 |
+
|
| 19 |
+
```
|
| 20 |
+
conda create -n "psalm" python=3.10
|
| 21 |
+
conda activate psalm
|
| 22 |
+
pip install torch protein-sequence-annotation notebook ipykernel
|
| 23 |
+
python -m ipykernel install --user
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
OR just install PSALM alone by using the [`protein-sequence-annotation` PyPI package](https://pypi.org/project/protein-sequence-annotation/#description).
|
| 27 |
+
```
|
| 28 |
+
pip install protein-sequence-annotation
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
After the pip install, you can load and use a pretrained model as follows:
|
| 32 |
+
```python
|
| 33 |
+
import torch
|
| 34 |
+
from psalm import psalm
|
| 35 |
+
|
| 36 |
+
# Load PSALM clan and fam models
|
| 37 |
+
PSALM = psalm(clan_model_name="ProteinSequenceAnnotation/PSALM-1-clan",
|
| 38 |
+
fam_model_name="ProteinSequenceAnnotation/PSALM-1-family",
|
| 39 |
+
device = 'cpu') #cpu by default, replace with 'cuda' or 'mps' as needed
|
| 40 |
+
|
| 41 |
+
# Prepare data (use PSALM.read_fasta(fasta_file_path) to get data directly from a FASTA file)
|
| 42 |
+
data = [
|
| 43 |
+
("Human Beta Globin", "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH"),
|
| 44 |
+
("Flavohemoprotein", "MLDAQTIATVKATIPLLVETGPKLTAHFYDRMFTHNPELKEIFNMSNQRNGDQREALFNAIAAYASNIENLPALLPAVEKIAQKHTSFQIKPEQYNIVGEHLLATLDEMFSPGQEVLDAWGKAYGVLANVFINREAEIYNENASKAGGWEGTRDFRIVAKTPRSALITSFELEPVDGGAVAEYRPGQYLGVWLKPEGFPHQEIRQYSLTRKPDGKGYRIAVKREEGGQVSNWLHNHANVGDVVKLVAPAGDFFMAVADDTPVTLISAGVGQTPMLAMLDTLAKAGHTAQVNWFHAAENGDVHAFADEVKELGQSLPRFTAHTWYRQPSEADRAKGQFDSEGLMDLSKLEGAFSDPTMQFYLCGPVGFMQFTAKQLVDLGVKQENIHYECFGPHKVL")
|
| 45 |
+
]
|
| 46 |
+
|
| 47 |
+
# Visualize PSALM annotations (add optional save_path argument: PSALM.annotate(data,save_path="save_folder")
|
| 48 |
+
PSALM.annotate(data)
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
## Cite
|
| 52 |
+
If you find PSALM useful in your research, please cite the following paper:
|
| 53 |
+
```bibtex
|
| 54 |
+
@article {sarkarkrishnan2024psalm,
|
| 55 |
+
author = {Sarkar, Arpan and Krishnan, Kumaresh and Eddy, Sean R},
|
| 56 |
+
title = {Protein Sequence Domain Annotation using Language Models},
|
| 57 |
+
year = {2024},
|
| 58 |
+
URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596712},
|
| 59 |
+
journal = {bioRxiv}
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
+
```
|
| 63 |
+
|