--- license: apache-2.0 library_name: pytorch pipeline_tag: text-generation tags: - protein - protein-function-prediction - swiss-prot - esm-c - galactica - q-former - blip-2 - reliability language: - en --- # ProtTale ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function description, together with a binary reliability score indicating whether the generated description is likely to be correct. - ๐Ÿงช **Demo (Hugging Face Space):** [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo) - ๐Ÿ’ป **Source code:** [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale) ## Model description ProtTale is a three-stage framework: 1. **Stage 1 โ€” Protein/text alignment.** An ESM-C 300M protein encoder is aligned to a Q-Former with three alignment losses, producing a fixed-length sequence of protein query tokens. 2. **Stage 2 โ€” Function-text generation.** The Q-Former tokens are fed as a prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style function descriptions. 3. **Reliability training.** A lightweight binary classification head is trained on validation/test predictions of the Stage 2 model. It predicts whether the generated description is reliable. Components: | Component | Backbone | | --- | --- | | Protein encoder | `esmc_300m` (ESM-C 300M, LoRA-tuned) | | Bridging module | Q-Former (BiomedNLP-PubMedBERT base) | | Generation LLM | `facebook/galactica-1.3b` (LoRA-tuned) | | Reliability head | MLP over Q-Former + LLM hidden states | ## Files | File | Description | | --- | --- | | `checkpoint.ckpt` | Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. | ## Usage Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo). To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale) and follow the setup instructions. Download the checkpoint: ```python from huggingface_hub import hf_hub_download ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt") ``` Then run inference: ```bash python predict_single.py \ --ckpt \ --seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG ``` Each output is a JSON line: ```json { "sequence": "MKTVRQER...", "prediction": "Catalyzes ...", "reliability": 1.0, "reliability_pos_prob": 0.9123 } ``` - `reliability` โˆˆ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable) - `reliability_pos_prob` โˆˆ [0, 1] โ€” model probability for the reliable class ## Training data Swiss-Prot function descriptions, split into train / validation / test / unseen sets (the `SwissProtV3` splits in the GitHub repository). ## Intended use & limitations - **Intended use:** research-oriented annotation of protein function from sequence; assistance in literature triage; downstream filtering of predictions via the reliability score. - **Out of scope:** clinical decision-making, safety-critical applications, inference on sequences longer than 1024 residues (the model truncates). - The reliability score is a model-internal confidence estimate, not a guarantee of correctness. ## License Apache 2.0.