File size: 1,276 Bytes
98eab6e
 
 
 
 
 
 
 
 
a5ca4b0
 
 
 
6f04e09
 
a5ca4b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
554e74b
a5ca4b0
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
tags:
- bioinformatics
- gene
- gene set
- model_hub_mixin
- pytorch_model_hub_mixin
---

# GSFM

Trained on millions of gene sets automatically extracted from literature and raw RNA-seq data, GSFM learns to recover held-out genes from gene sets. The resulting model exhibits state of the art performance on gene function prediction.

**Deprecation Notice**: This repo was replaced with <https://github.com/MaayanLab/gsfm> -- you can now access different versions of the model, stored on huggingface, directions in that repository.

## Website

<https://gsfm.maayanlab.cloud/>

## Usage

```bash
# install gsfm python library from its source on huggingface
GIT_LFS_SKIP_SMUDGE=1 pip install git+https://huggingface.co/maayanlab/gsfm
```

```python
import torch
from gsfm import Vocab, GSFM

# load gsfm vocabulary and model weights
vocab = Vocab.from_pretrained('maayanlab/gsfm')
gsfm = GSFM.from_pretrained('maayanlab/gsfm')
gsfm.eval()

# convert gene symbols into token ids
token_ids = torch.tensor(vocab(['ACE1', 'ACE2']))[None, :]

# use model to predict missing genes from the set
logits = torch.squeeze(gsfm(token_ids))
top_10 = sorted(zip(logits, vocab.vocab))[-10:]
top_10

# get model middle layer
gene_set_encoding = gsfm.encode(token_ids)
gene_set_encoding
```