ProtTale
ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function description, together with a binary reliability score indicating whether the generated description is likely to be correct.
- ๐งช Demo (Hugging Face Space): Mulah/ProtTale-demo
- ๐ป Source code: github.com/mulahteele/ProtTale
Model description
ProtTale is a three-stage framework:
- Stage 1 โ Protein/text alignment. An ESM-C 300M protein encoder is aligned to a Q-Former with three alignment losses, producing a fixed-length sequence of protein query tokens.
- Stage 2 โ Function-text generation. The Q-Former tokens are fed as a prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style function descriptions.
- Reliability training. A lightweight binary classification head is trained on validation/test predictions of the Stage 2 model. It predicts whether the generated description is reliable.
Components:
| Component | Backbone |
|---|---|
| Protein encoder | esmc_300m (ESM-C 300M, LoRA-tuned) |
| Bridging module | Q-Former (BiomedNLP-PubMedBERT base) |
| Generation LLM | facebook/galactica-1.3b (LoRA-tuned) |
| Reliability head | MLP over Q-Former + LLM hidden states |
Files
| File | Description |
|---|---|
checkpoint.ckpt |
Final reliability-finetuned checkpoint (~3.7 GB). Loaded by predict_single.py and by the demo Space. |
Usage
Try it without any setup at the demo Space.
To run locally, clone the GitHub repository and follow the setup instructions. Download the checkpoint:
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")
Then run inference:
python predict_single.py \
--ckpt <path-to-checkpoint.ckpt> \
--seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
Each output is a JSON line:
{
"sequence": "MKTVRQER...",
"prediction": "Catalyzes ...",
"reliability": 1.0,
"reliability_pos_prob": 0.9123
}
reliabilityโ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)reliability_pos_probโ [0, 1] โ model probability for the reliable class
Training data
Swiss-Prot function descriptions, split into train / validation / test /
unseen sets (the SwissProtV3 splits in the GitHub repository).
Intended use & limitations
- Intended use: research-oriented annotation of protein function from sequence; assistance in literature triage; downstream filtering of predictions via the reliability score.
- Out of scope: clinical decision-making, safety-critical applications, inference on sequences longer than 1024 residues (the model truncates).
- The reliability score is a model-internal confidence estimate, not a guarantee of correctness.
License
Apache 2.0.