Text Generation
Transformers
Safetensors
mixtral
biology
protein-language-model
protein-generation
causal-lm
mixture-of-experts
text-generation-inference
Instructions to use protgpt3/ProtGPT3-10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use protgpt3/ProtGPT3-10B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="protgpt3/ProtGPT3-10B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("protgpt3/ProtGPT3-10B") model = AutoModelForCausalLM.from_pretrained("protgpt3/ProtGPT3-10B") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use protgpt3/ProtGPT3-10B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "protgpt3/ProtGPT3-10B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "protgpt3/ProtGPT3-10B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/protgpt3/ProtGPT3-10B
- SGLang
How to use protgpt3/ProtGPT3-10B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "protgpt3/ProtGPT3-10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "protgpt3/ProtGPT3-10B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "protgpt3/ProtGPT3-10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "protgpt3/ProtGPT3-10B", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use protgpt3/ProtGPT3-10B with Docker Model Runner:
docker model run hf.co/protgpt3/ProtGPT3-10B
| library_name: transformers | |
| tags: | |
| - biology | |
| - protein-language-model | |
| - protein-generation | |
| - causal-lm | |
| - mixture-of-experts | |
| - transformers | |
| # Model Card for ProtGPT3-1OB | |
| ## Model Details | |
| ### Model Description | |
| ProtGPT3-1OB is a single-sequence autoregressive protein language model for protein sequence generation. It is part of the ProtGPT3 family, an open-source suite of promptable and aligned protein language models ranging from 112M to 10B parameters. ProtGPT3 models use a causal Mixtral-style Mixture-of-Experts architecture and are trained for causal language modeling on protein sequences. | |
| The single-sequence ProtGPT3 models can generate proteins in either N-to-C or C-to-N direction using special directional tokens. The model is intended for unconditional or prefix-conditioned protein sequence generation and can be used as a base model for downstream protein design workflows. | |
| - **Developed by:** Anonymous authors | |
| - **Model type:** Autoregressive protein language model; causal decoder-only Mixture-of-Experts model | |
| - **Language(s):** Protein sequences / amino-acid sequences | |
| - **License:** More Information Needed | |
| - **Finetuned from model:** Not applicable / pretrained from scratch | |
| ### Model Sources | |
| - **Repository:** https://huggingface.co/protgpt3 | |
| - **Paper:** ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models | |
| - **Code:** https://anonymous.4open.science/r/protGPT3-2053/README.md | |
| ## Uses | |
| ### Direct Use | |
| ProtGPT3-1OB can be used for autoregressive generation of protein sequences. Users can generate sequences unconditionally or condition generation on an amino-acid prefix. | |
| ### Downstream Use | |
| The model may be fine-tuned or incorporated into protein design workflows, including family-specific generation, protein variant generation, and computational screening pipelines. | |
| ### Out-of-Scope Use | |
| The model should not be used as the sole basis for experimental, clinical, environmental, or safety-critical decisions. Generated proteins require downstream computational and experimental validation. The model is not guaranteed to generate functional, soluble, safe, or synthesizable proteins. | |
| ## Bias, Risks, and Limitations | |
| ProtGPT3-1OB learns from public protein sequence datasets and may reproduce biases present in those datasets. Generated sequences may be low-complexity, nonfunctional, unstable, insoluble, or biologically implausible. Protein generation models may also present dual-use risks if used irresponsibly. | |
| ### Recommendations | |
| Users should apply appropriate computational filters, expert review, and experimental validation before using generated sequences. Users should also consider responsible-use practices for generative protein design. | |
| ## How to Get Started with the Model | |
| Install dependencies: | |
| ```bash | |
| pip install transformers accelerate torch | |
| ``` | |
| Load the model and tokenizer: | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| model_id = "protgpt3/ProtGPT3-1OB" # Replace with the final checkpoint name | |
| # Load tokenizer for generation | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,add_bos_token=True, add_eos_token=False) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| model.eval() | |
| ``` | |
| ### Generate a protein sequence | |
| ```python | |
| import torch | |
| prompt = "" # Optionally provide an amino-acid prefix or model-specific direction | |
| inputs = tokenizer(prompt, return_tensors="pt", padding_side="left").to(model.device) | |
| with torch.no_grad(): | |
| output_ids = model.generate( | |
| inputs["input_ids"], | |
| max_new_tokens=512, | |
| do_sample=True, | |
| temperature=0.8, | |
| top_p=0.9, | |
| eos_token_id=tokenizer.eos_token_id, | |
| pad_token_id=tokenizer.pad_token_id, | |
| ) | |
| sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True) | |
| print(sequence) # output includes directional token "1" or "2" to denote if sequence was generated N-to-C or C-to-N | |
| ``` | |
| ### Generate from an amino-acid prefix | |
| ```python | |
| import torch | |
| # forward N-to-C generation with special token "1" | |
| prefix = "1MKT" # use special token "2" instead of "1" for reverse C-to-N generation | |
| inputs = tokenizer(prefix, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| output_ids = model.generate( | |
| inputs["input_ids"], | |
| max_new_tokens=256, | |
| do_sample=True, | |
| temperature=0.8, | |
| top_p=0.9, | |
| eos_token_id=tokenizer.eos_token_id, | |
| pad_token_id=tokenizer.eos_token_id, | |
| ) | |
| sequence = tokenizer.decode(output_ids[0], skip_special_tokens=True) | |
| print(sequence) | |
| ``` | |
| ### Batch generation | |
| ```python | |
| import torch | |
| prompts = [ | |
| "", | |
| "1MKT", # N-to-C generation | |
| "2MAV", # C-to-N generation | |
| ] | |
| inputs = tokenizer( | |
| prompts, | |
| return_tensors="pt", | |
| padding=True, | |
| ).to(model.device) | |
| with torch.no_grad(): | |
| output_ids = model.generate( | |
| inputs["input_ids"], | |
| max_new_tokens=256, | |
| do_sample=True, | |
| temperature=0.8, | |
| top_p=0.9, | |
| eos_token_id=tokenizer.eos_token_id, | |
| pad_token_id=tokenizer.bos_token_id, | |
| ) | |
| sequences = tokenizer.batch_decode(output_ids, skip_special_tokens=True) | |
| for sequence in sequences: | |
| print(sequence) | |
| ``` | |
| ## Training Details | |
| ### Training Data | |
| ProtGPT3-1OB was trained on publicly available protein sequence data from UniRef90 and the GigaRef subset of the Dayhoff Atlas. The 1OB-parameter model used approximately 15M UniRef90 sequences and 28M GigaRef sequences, corresponding to approximately 9.8B training tokens. | |
| ### Training Procedure | |
| #### Preprocessing | |
| Protein sequences were sampled from UniRef90 and GigaRef. During training, each sequence was assigned a generation direction, either N-to-C or C-to-N, with a special token prepended to indicate the direction. | |
| #### Training Hyperparameters | |
| - **Training regime:** bfloat16 | |
| - **Architecture:** Mixtral-style sparse Mixture-of-Experts causal decoder | |
| - **Maximum sequence length:** 1024 | |
| - **Optimizer:** AdamW | |
| - **Learning rate:** 5e-4 | |
| - **Weight decay:** 0.1 | |
| - **Gradient clipping:** 1.0 | |
| - **Batch size:** 500 | |
| - **Number of training GPUs:** 4 | |
| ## Evaluation | |
| ### Testing Data, Factors & Metrics | |
| #### Testing Data | |
| The model was evaluated on held-out protein sequences with at most 50% sequence identity to the training set. It was also benchmarked on ProteinGym. | |
| #### Metrics | |
| Evaluation included validation perplexity, sequence diversity, predicted pLDDT, proportion of terminating sequences, proportion of low-complexity sequences, and ProteinGym Spearman correlation. | |
| ### Results | |
| Larger ProtGPT3 single-sequence models showed improved perplexity, sequence quality, and diversity. ProtGPT3-1OB serves as the smallest single-sequence model in the family and provides a computationally accessible checkpoint for protein generation. | |
| ## Technical Specifications | |
| ### Model Architecture and Objective | |
| ProtGPT3-1OB is a decoder-only causal language model using a Mixtral-style sparse Mixture-of-Experts architecture. It was trained with a causal language modeling objective on protein sequences. | |
| ### Compute Infrastructure | |
| #### Hardware | |
| NVIDIA H100 GPUs. | |
| #### Software | |
| Training used FlashAttention-2, online mini-batch packing, Liger Kernel, and DeepSpeed. | |
| ## Citation | |
| **BibTeX:** | |
| ```bibtex | |
| @article{protgpt3, | |
| title={ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models}, | |
| author={Anonymous Authors}, | |
| year={2026} | |
| } | |
| ``` | |
| ## More Information | |
| All models and code are released through the Hugging Face ecosystem and accompanying code repository. | |
| ## Model Card Authors | |
| Anonymous authors | |
| ## Model Card Contact | |
| More Information Needed |