IDPro

IDPro (Illuminating the Dark PROteome) is a multimodal protein → text model. A frozen ESM C protein encoder feeds per-residue embeddings through a 1D-Conv adaptor and an MLP projector into a Qwen3.5-27B language model. This release is the fully-trained end-to-end model used for inital exploratory analysis

What's in this release

The release ships only the trained weights on top of the two frozen base models — the bases (ESM C 600M and Qwen3.5-27B) are not bundled.

Component	Shape / size	Source
`adaptor` (2× 1D Conv, kernel 7)	encoder_dim=1152	this release
`projector` (2-layer MLP)	1152 → 5120	this release
`protein_modality_embed` + `prot_end_embed`	(1, 1, 5120) each	this release
`protein_position` (sinusoidal + learned scale)	small	this release
`evidence_head_pre` (pre-LLM, on adaptor output)	9-class	this release
`evidence_head_post` (post-LLM, layer 48 of Qwen)	9-class	this release
`llm.lm_head.weight` (resized output projection)	(vocab × 5120)	this release
ESM C 600M encoder (frozen)	—	`EvolutionaryScale/esmc-600m-2024-12`
Qwen3.5-27B base LLM (frozen)	—	`Qwen/Qwen3.5-27B`

idpro_state.pt is a {name: tensor} dict with ~29 keys. It is loaded with strict=False against the IDPro module tree (the encoder and base LLM keys are not present in this dict — they come from the frozen bases).

Usage

import torch
from idpro.model import IDProModel

model = IDProModel.from_pretrained(
    "user/idpro-esmc-stage4",
    device="cuda",
    dtype=torch.bfloat16,
    # llm_name="<local-qwen-path>",   # optional: point at a local Qwen snapshot
)

answers = model.generate(
    sequences=["MGSSHHHHHHSSGLVPRGSHM..."],
    questions=["What is the function of this protein?"],
)

The Aim 1 evaluation pipeline (probe benchmarks + conformal calibration) takes a local checkpoint directory via --ckpt, so you can either point it at the release dir directly or stage one with huggingface_hub.snapshot_download.

Special tokens

The training added four IDPro special tokens to the Qwen tokenizer:

<PROT_START>, <PROT_END>, <RAG_START>, <RAG_END>

from_pretrained rebuilds the tokenizer by adding these via tokenizer.add_special_tokens(...) and resizing the model's token embeddings to match the saved lm_head row count.

License

Weights are released under CC-BY-NC-4.0 (research use; check upstream model licenses for ESM C and Qwen3.5-27B before redistribution).

Downloads last month: 46

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tumorailab/IDPRO-ESMC600M

Base model

EvolutionaryScale/esmc-600m-2024-12

Finetuned

(3)

this model

Collection including tumorailab/IDPRO-ESMC600M

ID-PRO

Collection

IDPro (Illuminating the Dark PROteome) • 2 items • Updated 6 days ago