Mulah commited on
Commit
8876d1e
·
verified ·
1 Parent(s): b76b88b

Add model card

Browse files
Files changed (1) hide show
  1. README.md +108 -3
README.md CHANGED
@@ -1,3 +1,108 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: pytorch
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - protein
7
+ - protein-function-prediction
8
+ - swiss-prot
9
+ - esm-c
10
+ - galactica
11
+ - q-former
12
+ - blip-2
13
+ - reliability
14
+ language:
15
+ - en
16
+ ---
17
+
18
+ # ProtTale
19
+
20
+ ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function
21
+ description, together with a binary reliability score indicating whether the
22
+ generated description is likely to be correct.
23
+
24
+ - 🧪 **Demo (Hugging Face Space):** [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo)
25
+ - 💻 **Source code:** [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale)
26
+
27
+ ## Model description
28
+
29
+ ProtTale is a three-stage framework:
30
+
31
+ 1. **Stage 1 — Protein/text alignment.** An ESM-C 300M protein encoder is
32
+ aligned to a Q-Former with three alignment losses, producing a fixed-length
33
+ sequence of protein query tokens.
34
+ 2. **Stage 2 — Function-text generation.** The Q-Former tokens are fed as a
35
+ prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style
36
+ function descriptions.
37
+ 3. **Reliability training.** A lightweight binary classification head is
38
+ trained on validation/test predictions of the Stage 2 model. It predicts
39
+ whether the generated description is reliable.
40
+
41
+ Components:
42
+
43
+ | Component | Backbone |
44
+ | --- | --- |
45
+ | Protein encoder | `esmc_300m` (ESM-C 300M, LoRA-tuned) |
46
+ | Bridging module | Q-Former (BiomedNLP-PubMedBERT base) |
47
+ | Generation LLM | `facebook/galactica-1.3b` (LoRA-tuned) |
48
+ | Reliability head | MLP over Q-Former + LLM hidden states |
49
+
50
+ ## Files
51
+
52
+ | File | Description |
53
+ | --- | --- |
54
+ | `checkpoint.ckpt` | Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. |
55
+
56
+ ## Usage
57
+
58
+ Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo).
59
+
60
+ To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale)
61
+ and follow the setup instructions. Download the checkpoint:
62
+
63
+ ```python
64
+ from huggingface_hub import hf_hub_download
65
+
66
+ ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")
67
+ ```
68
+
69
+ Then run inference:
70
+
71
+ ```bash
72
+ python predict_single.py \
73
+ --ckpt <path-to-checkpoint.ckpt> \
74
+ --seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
75
+ ```
76
+
77
+ Each output is a JSON line:
78
+
79
+ ```json
80
+ {
81
+ "sequence": "MKTVRQER...",
82
+ "prediction": "Catalyzes ...",
83
+ "reliability": 1.0,
84
+ "reliability_pos_prob": 0.9123
85
+ }
86
+ ```
87
+
88
+ - `reliability` ∈ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)
89
+ - `reliability_pos_prob` ∈ [0, 1] — model probability for the reliable class
90
+
91
+ ## Training data
92
+
93
+ Swiss-Prot function descriptions, split into train / validation / test /
94
+ unseen sets (the `SwissProtV3` splits in the GitHub repository).
95
+
96
+ ## Intended use & limitations
97
+
98
+ - **Intended use:** research-oriented annotation of protein function from
99
+ sequence; assistance in literature triage; downstream filtering of
100
+ predictions via the reliability score.
101
+ - **Out of scope:** clinical decision-making, safety-critical applications,
102
+ inference on sequences longer than 1024 residues (the model truncates).
103
+ - The reliability score is a model-internal confidence estimate, not a
104
+ guarantee of correctness.
105
+
106
+ ## License
107
+
108
+ Apache 2.0.