File size: 3,271 Bytes
8876d1e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | ---
license: apache-2.0
library_name: pytorch
pipeline_tag: text-generation
tags:
- protein
- protein-function-prediction
- swiss-prot
- esm-c
- galactica
- q-former
- blip-2
- reliability
language:
- en
---
# ProtTale
ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function
description, together with a binary reliability score indicating whether the
generated description is likely to be correct.
- 🧪 **Demo (Hugging Face Space):** [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo)
- 💻 **Source code:** [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale)
## Model description
ProtTale is a three-stage framework:
1. **Stage 1 — Protein/text alignment.** An ESM-C 300M protein encoder is
aligned to a Q-Former with three alignment losses, producing a fixed-length
sequence of protein query tokens.
2. **Stage 2 — Function-text generation.** The Q-Former tokens are fed as a
prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style
function descriptions.
3. **Reliability training.** A lightweight binary classification head is
trained on validation/test predictions of the Stage 2 model. It predicts
whether the generated description is reliable.
Components:
| Component | Backbone |
| --- | --- |
| Protein encoder | `esmc_300m` (ESM-C 300M, LoRA-tuned) |
| Bridging module | Q-Former (BiomedNLP-PubMedBERT base) |
| Generation LLM | `facebook/galactica-1.3b` (LoRA-tuned) |
| Reliability head | MLP over Q-Former + LLM hidden states |
## Files
| File | Description |
| --- | --- |
| `checkpoint.ckpt` | Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. |
## Usage
Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo).
To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale)
and follow the setup instructions. Download the checkpoint:
```python
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")
```
Then run inference:
```bash
python predict_single.py \
--ckpt <path-to-checkpoint.ckpt> \
--seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
```
Each output is a JSON line:
```json
{
"sequence": "MKTVRQER...",
"prediction": "Catalyzes ...",
"reliability": 1.0,
"reliability_pos_prob": 0.9123
}
```
- `reliability` ∈ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)
- `reliability_pos_prob` ∈ [0, 1] — model probability for the reliable class
## Training data
Swiss-Prot function descriptions, split into train / validation / test /
unseen sets (the `SwissProtV3` splits in the GitHub repository).
## Intended use & limitations
- **Intended use:** research-oriented annotation of protein function from
sequence; assistance in literature triage; downstream filtering of
predictions via the reliability score.
- **Out of scope:** clinical decision-making, safety-critical applications,
inference on sequences longer than 1024 residues (the model truncates).
- The reliability score is a model-internal confidence estimate, not a
guarantee of correctness.
## License
Apache 2.0.
|