ProtTale

ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function description, together with a binary reliability score indicating whether the generated description is likely to be correct.

Model description

ProtTale is a three-stage framework:

  1. Stage 1 โ€” Protein/text alignment. An ESM-C 300M protein encoder is aligned to a Q-Former with three alignment losses, producing a fixed-length sequence of protein query tokens.
  2. Stage 2 โ€” Function-text generation. The Q-Former tokens are fed as a prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style function descriptions.
  3. Reliability training. A lightweight binary classification head is trained on validation/test predictions of the Stage 2 model. It predicts whether the generated description is reliable.

Components:

Component Backbone
Protein encoder esmc_300m (ESM-C 300M, LoRA-tuned)
Bridging module Q-Former (BiomedNLP-PubMedBERT base)
Generation LLM facebook/galactica-1.3b (LoRA-tuned)
Reliability head MLP over Q-Former + LLM hidden states

Files

File Description
checkpoint.ckpt Final reliability-finetuned checkpoint (~3.7 GB). Loaded by predict_single.py and by the demo Space.

Usage

Try it without any setup at the demo Space.

To run locally, clone the GitHub repository and follow the setup instructions. Download the checkpoint:

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")

Then run inference:

python predict_single.py \
  --ckpt <path-to-checkpoint.ckpt> \
  --seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG

Each output is a JSON line:

{
  "sequence": "MKTVRQER...",
  "prediction": "Catalyzes ...",
  "reliability": 1.0,
  "reliability_pos_prob": 0.9123
}
  • reliability โˆˆ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)
  • reliability_pos_prob โˆˆ [0, 1] โ€” model probability for the reliable class

Training data

Swiss-Prot function descriptions, split into train / validation / test / unseen sets (the SwissProtV3 splits in the GitHub repository).

Intended use & limitations

  • Intended use: research-oriented annotation of protein function from sequence; assistance in literature triage; downstream filtering of predictions via the reliability score.
  • Out of scope: clinical decision-making, safety-critical applications, inference on sequences longer than 1024 residues (the model truncates).
  • The reliability score is a model-internal confidence estimate, not a guarantee of correctness.

License

Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Mulah/ProtTale 1