File size: 1,505 Bytes
e3b7ccf
 
 
caf369c
e3b7ccf
 
 
 
caf369c
e3b7ccf
 
 
 
 
 
 
 
 
 
 
 
 
caf369c
e3b7ccf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
library_name: transformers
pipeline_tag: feature-extraction
model_name: InstaDeepAI/IDP-ESM2-8M
---

# IDP-ESM2-8M

**IDP-ESM2-8M** is an ESM2-style encoder for intrinsically disorded protein sequence representation learning, trained on [IDP-Euka-90](https://huggingface.co/datasets/InstaDeepAI/IDP-Euka-90).  
This repository provides a Transformer encoder suitable for extracting **sequence embeddings**.

---

## Quick start: generate embeddings

The snippet below loads the tokenizer and model, runs a forward pass on a couple of sequences and extracts embeddings for each sequence.

```python
from transformers import AutoTokenizer, AutoModel
import torch

# --- Config ---
model_name = "InstaDeepAI/IDP-ESM2-8M"

# --- Load model and tokenizer ---
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained(model_name)
model.eval()

# (optional) use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# --- Input sequences ---
sequences = [
    "MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ",
    "PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR",
]

# --- Tokenize ---
inputs = tokenizer(
    sequences,
    return_tensors="pt",
    padding=True,
    truncation=True,
)
inputs = {k: v.to(device) for k, v in inputs.items()}

# --- Forward pass ---
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # shape: (batch, seq_len, hidden_dim)