IDPro
IDPro (Illuminating the Dark PROteome) is a multimodal protein โ text model. A frozen ESM C protein encoder feeds per-residue embeddings through a 1D-Conv adaptor and an MLP projector into a Qwen3.5-27B language model. This release is the fully-trained end-to-end model used for inital exploratory analysis
What's in this release
The release ships only the trained weights on top of the two frozen base models โ the bases (ESM C 600M and Qwen3.5-27B) are not bundled.
| Component | Shape / size | Source |
|---|---|---|
adaptor (2ร 1D Conv, kernel 7) |
encoder_dim=1152 | this release |
projector (2-layer MLP) |
1152 โ 5120 | this release |
protein_modality_embed + prot_end_embed |
(1, 1, 5120) each | this release |
protein_position (sinusoidal + learned scale) |
small | this release |
evidence_head_pre (pre-LLM, on adaptor output) |
9-class | this release |
evidence_head_post (post-LLM, layer 48 of Qwen) |
9-class | this release |
llm.lm_head.weight (resized output projection) |
(vocab ร 5120) | this release |
| ESM C 600M encoder (frozen) | โ | EvolutionaryScale/esmc-600m-2024-12 |
| Qwen3.5-27B base LLM (frozen) | โ | Qwen/Qwen3.5-27B |
idpro_state.pt is a {name: tensor} dict with ~29 keys. It is loaded with
strict=False against the IDPro module tree (the encoder and base LLM keys
are not present in this dict โ they come from the frozen bases).
Usage
import torch
from idpro.model import IDProModel
model = IDProModel.from_pretrained(
"user/idpro-esmc-stage4",
device="cuda",
dtype=torch.bfloat16,
# llm_name="<local-qwen-path>", # optional: point at a local Qwen snapshot
)
answers = model.generate(
sequences=["MGSSHHHHHHSSGLVPRGSHM..."],
questions=["What is the function of this protein?"],
)
The Aim 1 evaluation pipeline (probe benchmarks + conformal calibration) takes
a local checkpoint directory via --ckpt, so you can either point it at the
release dir directly or stage one with huggingface_hub.snapshot_download.
Special tokens
The training added four IDPro special tokens to the Qwen tokenizer:
<PROT_START>, <PROT_END>, <RAG_START>, <RAG_END>
from_pretrained rebuilds the tokenizer by adding these via
tokenizer.add_special_tokens(...) and resizing the model's token embeddings
to match the saved lm_head row count.
License
Weights are released under CC-BY-NC-4.0 (research use; check upstream model licenses for ESM C and Qwen3.5-27B before redistribution).
- Downloads last month
- 46
Model tree for tumorailab/IDPRO-ESMC600M
Base model
EvolutionaryScale/esmc-600m-2024-12