|
|
--- |
|
|
tags: |
|
|
- bioinformatics |
|
|
- gene |
|
|
- gene set |
|
|
- model_hub_mixin |
|
|
- pytorch_model_hub_mixin |
|
|
--- |
|
|
|
|
|
# GSFM |
|
|
|
|
|
Trained on millions of gene sets automatically extracted from literature and raw RNA-seq data, GSFM learns to recover held-out genes from gene sets. The resulting model exhibits state of the art performance on gene function prediction. |
|
|
|
|
|
**Deprecation Notice**: This repo was replaced with <https://github.com/MaayanLab/gsfm> -- you can now access different versions of the model, stored on huggingface, directions in that repository. |
|
|
|
|
|
## Website |
|
|
|
|
|
<https://gsfm.maayanlab.cloud/> |
|
|
|
|
|
## Usage |
|
|
|
|
|
```bash |
|
|
# install gsfm python library from its source on huggingface |
|
|
GIT_LFS_SKIP_SMUDGE=1 pip install git+https://huggingface.co/maayanlab/gsfm |
|
|
``` |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from gsfm import Vocab, GSFM |
|
|
|
|
|
# load gsfm vocabulary and model weights |
|
|
vocab = Vocab.from_pretrained('maayanlab/gsfm') |
|
|
gsfm = GSFM.from_pretrained('maayanlab/gsfm') |
|
|
gsfm.eval() |
|
|
|
|
|
# convert gene symbols into token ids |
|
|
token_ids = torch.tensor(vocab(['ACE1', 'ACE2']))[None, :] |
|
|
|
|
|
# use model to predict missing genes from the set |
|
|
logits = torch.squeeze(gsfm(token_ids)) |
|
|
top_10 = sorted(zip(logits, vocab.vocab))[-10:] |
|
|
top_10 |
|
|
|
|
|
# get model middle layer |
|
|
gene_set_encoding = gsfm.encode(token_ids) |
|
|
gene_set_encoding |
|
|
``` |