Mulah
/

ProtTale

+---
+license: apache-2.0
+library_name: pytorch
+pipeline_tag: text-generation
+tags:
+  - protein
+  - protein-function-prediction
+  - swiss-prot
+  - esm-c
+  - galactica
+  - q-former
+  - blip-2
+  - reliability
+language:
+  - en
+---
+# ProtTale
+ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function
+description, together with a binary reliability score indicating whether the
+generated description is likely to be correct.
+- 🧪 **Demo (Hugging Face Space):** [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo)
+- 💻 **Source code:** [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale)
+## Model description
+ProtTale is a three-stage framework:
+1. **Stage 1 — Protein/text alignment.** An ESM-C 300M protein encoder is
+   aligned to a Q-Former with three alignment losses, producing a fixed-length
+   sequence of protein query tokens.
+2. **Stage 2 — Function-text generation.** The Q-Former tokens are fed as a
+   prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style
+   function descriptions.
+3. **Reliability training.** A lightweight binary classification head is
+   trained on validation/test predictions of the Stage 2 model. It predicts
+   whether the generated description is reliable.
+Components:
+| Component | Backbone |
+| --- | --- |
+| Protein encoder | `esmc_300m` (ESM-C 300M, LoRA-tuned) |
+| Bridging module | Q-Former (BiomedNLP-PubMedBERT base) |
+| Generation LLM | `facebook/galactica-1.3b` (LoRA-tuned) |
+| Reliability head | MLP over Q-Former + LLM hidden states |
+## Files
+| File | Description |
+| --- | --- |
+| `checkpoint.ckpt` | Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. |
+## Usage
+Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo).
+To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale)
+and follow the setup instructions. Download the checkpoint:
+```python
+from huggingface_hub import hf_hub_download
+ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")
+```
+Then run inference:
+```bash
+python predict_single.py \
+  --ckpt <path-to-checkpoint.ckpt> \
+  --seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
+```
+Each output is a JSON line:
+```json
+{
+  "sequence": "MKTVRQER...",
+  "prediction": "Catalyzes ...",
+  "reliability": 1.0,
+  "reliability_pos_prob": 0.9123
+}
+```
+- `reliability` ∈ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)
+- `reliability_pos_prob` ∈ [0, 1] — model probability for the reliable class
+## Training data
+Swiss-Prot function descriptions, split into train / validation / test /
+unseen sets (the `SwissProtV3` splits in the GitHub repository).
+## Intended use & limitations
+- **Intended use:** research-oriented annotation of protein function from
+  sequence; assistance in literature triage; downstream filtering of
+  predictions via the reliability score.
+- **Out of scope:** clinical decision-making, safety-critical applications,
+  inference on sequences longer than 1024 residues (the model truncates).
+- The reliability score is a model-internal confidence estimate, not a
+  guarantee of correctness.
+## License
+Apache 2.0.