Mulah
/

ProtTale

Text Generation

protein-function-prediction

Model card Files Files and versions

ProtTale / README.md

Mulah's picture

Add model card

8876d1e verified 1 day ago

|

history blame contribute delete

3.27 kB

	---
	license: apache-2.0
	library_name: pytorch
	pipeline_tag: text-generation
	tags:
	- protein
	- protein-function-prediction
	- swiss-prot
	- esm-c
	- galactica
	- q-former
	- blip-2
	- reliability
	language:
	- en
	---

	# ProtTale

	ProtTale maps a protein amino-acid sequence to a Swiss-Prot-style function
	description, together with a binary reliability score indicating whether the
	generated description is likely to be correct.

	- 🧪 Demo (Hugging Face Space): [Mulah/ProtTale-demo](https://huggingface.co/spaces/Mulah/ProtTale-demo)
	- 💻 Source code: [github.com/mulahteele/ProtTale](https://github.com/mulahteele/ProtTale)

	## Model description

	ProtTale is a three-stage framework:

	1. Stage 1 — Protein/text alignment. An ESM-C 300M protein encoder is
	aligned to a Q-Former with three alignment losses, producing a fixed-length
	sequence of protein query tokens.
	2. Stage 2 — Function-text generation. The Q-Former tokens are fed as a
	prefix to a Galactica-1.3B LLM (LoRA-tuned) that generates Swiss-Prot-style
	function descriptions.
	3. Reliability training. A lightweight binary classification head is
	trained on validation/test predictions of the Stage 2 model. It predicts
	whether the generated description is reliable.

	Components:

	\| Component \| Backbone \|
	\| --- \| --- \|
	\| Protein encoder \| `esmc_300m` (ESM-C 300M, LoRA-tuned) \|
	\| Bridging module \| Q-Former (BiomedNLP-PubMedBERT base) \|
	\| Generation LLM \| `facebook/galactica-1.3b` (LoRA-tuned) \|
	\| Reliability head \| MLP over Q-Former + LLM hidden states \|

	## Files

	\| File \| Description \|
	\| --- \| --- \|
	\| `checkpoint.ckpt` \| Final reliability-finetuned checkpoint (~3.7 GB). Loaded by `predict_single.py` and by the demo Space. \|

	## Usage

	Try it without any setup at the [demo Space](https://huggingface.co/spaces/Mulah/ProtTale-demo).

	To run locally, clone the [GitHub repository](https://github.com/mulahteele/ProtTale)
	and follow the setup instructions. Download the checkpoint:

	```python
	from huggingface_hub import hf_hub_download

	ckpt = hf_hub_download(repo_id="Mulah/ProtTale", filename="checkpoint.ckpt")
	```

	Then run inference:

	```bash
	python predict_single.py \
	--ckpt <path-to-checkpoint.ckpt> \
	--seq MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
	```

	Each output is a JSON line:

	```json
	{
	"sequence": "MKTVRQER...",
	"prediction": "Catalyzes ...",
	"reliability": 1.0,
	"reliability_pos_prob": 0.9123
	}
	```

	- `reliability` ∈ {0.0, 1.0} (1.0 = reliable, 0.0 = unreliable)
	- `reliability_pos_prob` ∈ [0, 1] — model probability for the reliable class

	## Training data

	Swiss-Prot function descriptions, split into train / validation / test /
	unseen sets (the `SwissProtV3` splits in the GitHub repository).

	## Intended use & limitations

	- Intended use: research-oriented annotation of protein function from
	sequence; assistance in literature triage; downstream filtering of
	predictions via the reliability score.
	- Out of scope: clinical decision-making, safety-critical applications,
	inference on sequences longer than 1024 residues (the model truncates).
	- The reliability score is a model-internal confidence estimate, not a
	guarantee of correctness.

	## License

	Apache 2.0.