| --- |
| library_name: transformers |
| pipeline_tag: feature-extraction |
| model_name: InstaDeepAI/IDP-ESM2-8M |
| --- |
| |
| # IDP-ESM2-8M |
|
|
| **IDP-ESM2-8M** is an ESM2-style encoder for intrinsically disorded protein sequence representation learning, trained on [IDP-Euka-90](https://huggingface.co/datasets/InstaDeepAI/IDP-Euka-90). |
| This repository provides a Transformer encoder suitable for extracting **sequence embeddings**. |
|
|
| --- |
|
|
| ## Quick start: generate embeddings |
|
|
| The snippet below loads the tokenizer and model, runs a forward pass on a couple of sequences and extracts embeddings for each sequence. |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| import torch |
| |
| # --- Config --- |
| model_name = "InstaDeepAI/IDP-ESM2-8M" |
| |
| # --- Load model and tokenizer --- |
| tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D") |
| model = AutoModel.from_pretrained(model_name) |
| model.eval() |
| |
| # (optional) use GPU if available |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model.to(device) |
| |
| # --- Input sequences --- |
| sequences = [ |
| "MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ", |
| "PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR", |
| ] |
| |
| # --- Tokenize --- |
| inputs = tokenizer( |
| sequences, |
| return_tensors="pt", |
| padding=True, |
| truncation=True, |
| ) |
| inputs = {k: v.to(device) for k, v in inputs.items()} |
| |
| # --- Forward pass --- |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| embeddings = outputs.last_hidden_state # shape: (batch, seq_len, hidden_dim) |
| |