File size: 3,271 Bytes
8876d1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
library_name: pytorch
pipeline_tag: text-generation
tags:
  - protein
  - protein-function-prediction
  - swiss-prot
  - esm-c
  - galactica
  - q-former
  - blip-2
  - reliability
language:
  - en
---

# ProtTale

ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function
description, together with a binary reliability score indicating whether the
generated description is likely to be correct.

- 🧪 **Demo (Hugging Face Space):** [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo)
- 💻 **Source code:** [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale)

## Model description

ProtTale is a three-stage framework:

1. **Stage 1 — Protein/text alignment.** An ESM-C 300M protein encoder is
   aligned to a Q-Former with three alignment losses, producing a fixed-length
   sequence of protein query tokens.
2. **Stage 2 — Function-text generation.** The Q-Former tokens are fed as a
   prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style
   function descriptions.
3. **Reliability training.** A lightweight binary classification head is
   trained on validation/test predictions of the Stage 2 model. It predicts
   whether the generated description is reliable.

Components:

| Component | Backbone |
| --- | --- |
| Protein encoder | `esmc_300m` (ESM-C 300M, LoRA-tuned) |
| Bridging module | Q-Former (BiomedNLP-PubMedBERT base) |
| Generation LLM | `facebook/galactica-1.3b` (LoRA-tuned) |
| Reliability head | MLP over Q-Former + LLM hidden states |

## Files

| File | Description |
| --- | --- |
| `checkpoint.ckpt` | Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. |

## Usage

Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo).

To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale)
and follow the setup instructions. Download the checkpoint:

```python
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")
```

Then run inference:

```bash
python predict_single.py \
  --ckpt <path-to-checkpoint.ckpt> \
  --seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
```

Each output is a JSON line:

```json
{
  "sequence": "MKTVRQER...",
  "prediction": "Catalyzes ...",
  "reliability": 1.0,
  "reliability_pos_prob": 0.9123
}
```

- `reliability` ∈ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)
- `reliability_pos_prob` ∈ [0, 1] — model probability for the reliable class

## Training data

Swiss-Prot function descriptions, split into train / validation / test /
unseen sets (the `SwissProtV3` splits in the GitHub repository).

## Intended use & limitations

- **Intended use:** research-oriented annotation of protein function from
  sequence; assistance in literature triage; downstream filtering of
  predictions via the reliability score.
- **Out of scope:** clinical decision-making, safety-critical applications,
  inference on sequences longer than 1024 residues (the model truncates).
- The reliability score is a model-internal confidence estimate, not a
  guarantee of correctness.

## License

Apache 2.0.