File size: 1,505 Bytes
e3b7ccf caf369c e3b7ccf caf369c e3b7ccf caf369c e3b7ccf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | ---
library_name: transformers
pipeline_tag: feature-extraction
model_name: InstaDeepAI/IDP-ESM2-8M
---
# IDP-ESM2-8M
**IDP-ESM2-8M** is an ESM2-style encoder for intrinsically disorded protein sequence representation learning, trained on [IDP-Euka-90](https://huggingface.co/datasets/InstaDeepAI/IDP-Euka-90).
This repository provides a Transformer encoder suitable for extracting **sequence embeddings**.
---
## Quick start: generate embeddings
The snippet below loads the tokenizer and model, runs a forward pass on a couple of sequences and extracts embeddings for each sequence.
```python
from transformers import AutoTokenizer, AutoModel
import torch
# --- Config ---
model_name = "InstaDeepAI/IDP-ESM2-8M"
# --- Load model and tokenizer ---
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = AutoModel.from_pretrained(model_name)
model.eval()
# (optional) use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# --- Input sequences ---
sequences = [
"MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ",
"PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR",
]
# --- Tokenize ---
inputs = tokenizer(
sequences,
return_tensors="pt",
padding=True,
truncation=True,
)
inputs = {k: v.to(device) for k, v in inputs.items()}
# --- Forward pass ---
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # shape: (batch, seq_len, hidden_dim)
|