armansa1
/

gogpt-test

Text Generation

function-prediction

Model card Files Files and versions

gogpt-test / README.md

armansa1's picture

Upload README.md with huggingface_hub

4b1791b verified 2 months ago

|

history blame contribute delete

2.47 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- protein
	- gene-ontology
	- function-prediction
	- biology
	- bioinformatics
	pipeline_tag: text-generation
	datasets:
	- wanglab/cafa5
	---

	# GO-GPT: Gene Ontology Prediction from Protein Sequences

	GO-GPT is a decoder-only transformer model for predicting Gene Ontology (GO) terms from protein sequences. It combines ESM2 protein language model embeddings with an autoregressive decoder to generate GO term annotations across all three ontology aspects: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

	## Quick Start

	1. Clone the repository:
	```bash
	git clone https://github.com/YOUR_ORG/gogpt
	cd gogpt
	```

	2. Run the inference notebook or use Python directly:
	```python
	import sys
	sys.path.insert(0, "src")

	from gogpt import GOGPTPredictor

	# Load from HuggingFace (downloads ~4GB on first run)
	predictor = GOGPTPredictor.from_pretrained("armansa1/gogpt-dev")

	# Predict GO terms
	predictions = predictor.predict(
	sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH",
	organism="Homo sapiens"
	)

	print(predictions)
	# {'MF': ['GO:0003674', 'GO:0005488', ...],
	# 'BP': ['GO:0008150', 'GO:0008152', ...],
	# 'CC': ['GO:0005575', 'GO:0110165', ...]}
	```

	## Model Architecture

	\| Component \| Description \|
	\|-----------\|-------------\|
	\| Protein Encoder \| ESM2-3B (`facebook/esm2_t36_3B_UR50D`) \|
	\| Decoder \| 12-layer GPT with prefix causal attention \|
	\| Embedding Dim \| 900 \|
	\| Attention Heads \| 12 \|
	\| Total Parameters \| ~3.2B (3B ESM2 + 200M decoder) \|

	## Supported Organisms

	GO-GPT supports organism-conditioned prediction for 200 organisms plus an `<UNKNOWN>` category (201 total). See `organism_list.txt` for the full list.

	Common organisms include:
	- Homo sapiens
	- Mus musculus
	- Escherichia coli (various strains)
	- Saccharomyces cerevisiae
	- Arabidopsis thaliana
	- Drosophila melanogaster

	For organisms not in the training set, predictions will use the `<UNKNOWN>` embedding.

	## Files in This Repository

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.ckpt` \| Model weights (PyTorch Lightning checkpoint) \|
	\| `config.yaml` \| Model architecture configuration \|
	\| `tokenizer_info.json` \| Token vocabulary metadata \|
	\| `go_tokenizer.json` \| GO term to token ID mapping \|
	\| `organism_mapper.json` \| Organism name to ID mapping \|
	\| `organism_list.txt` \| Human-readable list of 201 supported organisms \|