IDPro

IDPro (Illuminating the Dark PROteome) is a multimodal protein โ†’ text model. A frozen ESM C protein encoder feeds per-residue embeddings through a 1D-Conv adaptor and an MLP projector into a Qwen3.5-27B language model. This release is the fully-trained end-to-end model used for inital exploratory analysis

What's in this release

The release ships only the trained weights on top of the two frozen base models โ€” the bases (ESM C 600M and Qwen3.5-27B) are not bundled.

Component Shape / size Source
adaptor (2ร— 1D Conv, kernel 7) encoder_dim=1152 this release
projector (2-layer MLP) 1152 โ†’ 5120 this release
protein_modality_embed + prot_end_embed (1, 1, 5120) each this release
protein_position (sinusoidal + learned scale) small this release
evidence_head_pre (pre-LLM, on adaptor output) 9-class this release
evidence_head_post (post-LLM, layer 48 of Qwen) 9-class this release
llm.lm_head.weight (resized output projection) (vocab ร— 5120) this release
ESM C 600M encoder (frozen) โ€” EvolutionaryScale/esmc-600m-2024-12
Qwen3.5-27B base LLM (frozen) โ€” Qwen/Qwen3.5-27B

idpro_state.pt is a {name: tensor} dict with ~29 keys. It is loaded with strict=False against the IDPro module tree (the encoder and base LLM keys are not present in this dict โ€” they come from the frozen bases).

Usage

import torch
from idpro.model import IDProModel

model = IDProModel.from_pretrained(
    "user/idpro-esmc-stage4",
    device="cuda",
    dtype=torch.bfloat16,
    # llm_name="<local-qwen-path>",   # optional: point at a local Qwen snapshot
)

answers = model.generate(
    sequences=["MGSSHHHHHHSSGLVPRGSHM..."],
    questions=["What is the function of this protein?"],
)

The Aim 1 evaluation pipeline (probe benchmarks + conformal calibration) takes a local checkpoint directory via --ckpt, so you can either point it at the release dir directly or stage one with huggingface_hub.snapshot_download.

Special tokens

The training added four IDPro special tokens to the Qwen tokenizer:

<PROT_START>, <PROT_END>, <RAG_START>, <RAG_END>

from_pretrained rebuilds the tokenizer by adding these via tokenizer.add_special_tokens(...) and resizing the model's token embeddings to match the saved lm_head row count.

License

Weights are released under CC-BY-NC-4.0 (research use; check upstream model licenses for ESM C and Qwen3.5-27B before redistribution).

Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tumorailab/IDPRO-ESMC600M

Finetuned
(3)
this model

Collection including tumorailab/IDPRO-ESMC600M