| --- |
| license: apache-2.0 |
| library_name: pytorch |
| pipeline_tag: text-generation |
| tags: |
| - protein |
| - protein-function-prediction |
| - swiss-prot |
| - esm-c |
| - galactica |
| - q-former |
| - blip-2 |
| - reliability |
| language: |
| - en |
| --- |
| |
| # ProtTale |
|
|
| ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function |
| description, together with a binary reliability score indicating whether the |
| generated description is likely to be correct. |
|
|
| - π§ͺ **Demo (Hugging Face Space):** [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo) |
| - π» **Source code:** [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale) |
|
|
| ## Model description |
|
|
| ProtTale is a three-stage framework: |
|
|
| 1. **Stage 1 β Protein/text alignment.** An ESM-C 300M protein encoder is |
| aligned to a Q-Former with three alignment losses, producing a fixed-length |
| sequence of protein query tokens. |
| 2. **Stage 2 β Function-text generation.** The Q-Former tokens are fed as a |
| prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style |
| function descriptions. |
| 3. **Reliability training.** A lightweight binary classification head is |
| trained on validation/test predictions of the Stage 2 model. It predicts |
| whether the generated description is reliable. |
|
|
| Components: |
|
|
| | Component | Backbone | |
| | --- | --- | |
| | Protein encoder | `esmc_300m` (ESM-C 300M, LoRA-tuned) | |
| | Bridging module | Q-Former (BiomedNLP-PubMedBERT base) | |
| | Generation LLM | `facebook/galactica-1.3b` (LoRA-tuned) | |
| | Reliability head | MLP over Q-Former + LLM hidden states | |
|
|
| ## Files |
|
|
| | File | Description | |
| | --- | --- | |
| | `checkpoint.ckpt` | Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. | |
|
|
| ## Usage |
|
|
| Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo). |
|
|
| To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale) |
| and follow the setup instructions. Download the checkpoint: |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt") |
| ``` |
|
|
| Then run inference: |
|
|
| ```bash |
| python predict_single.py \ |
| --ckpt <path-to-checkpoint.ckpt> \ |
| --seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG |
| ``` |
|
|
| Each output is a JSON line: |
|
|
| ```json |
| { |
| "sequence": "MKTVRQER...", |
| "prediction": "Catalyzes ...", |
| "reliability": 1.0, |
| "reliability_pos_prob": 0.9123 |
| } |
| ``` |
|
|
| - `reliability` β {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable) |
| - `reliability_pos_prob` β [0, 1] β model probability for the reliable class |
|
|
| ## Training data |
|
|
| Swiss-Prot function descriptions, split into train / validation / test / |
| unseen sets (the `SwissProtV3` splits in the GitHub repository). |
|
|
| ## Intended use & limitations |
|
|
| - **Intended use:** research-oriented annotation of protein function from |
| sequence; assistance in literature triage; downstream filtering of |
| predictions via the reliability score. |
| - **Out of scope:** clinical decision-making, safety-critical applications, |
| inference on sequences longer than 1024 residues (the model truncates). |
| - The reliability score is a model-internal confidence estimate, not a |
| guarantee of correctness. |
|
|
| ## License |
|
|
| Apache 2.0. |
|
|