Instructions to use littleworth/protgpt2-distilled-medium with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use littleworth/protgpt2-distilled-medium with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="littleworth/protgpt2-distilled-medium")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("littleworth/protgpt2-distilled-medium")
model = AutoModelForCausalLM.from_pretrained("littleworth/protgpt2-distilled-medium")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use littleworth/protgpt2-distilled-medium with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "littleworth/protgpt2-distilled-medium"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "littleworth/protgpt2-distilled-medium",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/littleworth/protgpt2-distilled-medium

SGLang

How to use littleworth/protgpt2-distilled-medium with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "littleworth/protgpt2-distilled-medium" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "littleworth/protgpt2-distilled-medium",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "littleworth/protgpt2-distilled-medium" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "littleworth/protgpt2-distilled-medium",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use littleworth/protgpt2-distilled-medium with Docker Model Runner:
```
docker model run hf.co/littleworth/protgpt2-distilled-medium
```

ProtGPT2-Distilled-Medium

A compact protein language model distilled from ProtGPT2 using complementary-regularizer distillation---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 31% better perplexity than standard knowledge distillation at 3.8x compression. This is the highest-fidelity model in the family, achieving a perplexity ratio of just 2.58 relative to the teacher.

Preprint: Distilling Protein Language Models with Complementary Regularizers (Wijaya, 2026) — bioRxiv Code: github.com/ewijaya/protein-lm-distill

Model Summary

Property	Value
Parameters	~194M
Architecture	GPT-2 (12 layers, 16 heads, 1024 embedding dim)
Compression	3.8x (vs. 738M teacher)
Perplexity ratio	2.58 (31% better than baseline KD)
Expected calibration error	0.135 (20% better than baseline)
Inference speedup	2.4x over ProtGPT2
GPU memory	836 MB (3.8x reduction from teacher)
Throughput	~50 sequences/min on NVIDIA L40S

Quick Start

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline

model = GPT2LMHeadModel.from_pretrained("littleworth/protgpt2-distilled-medium")
tokenizer = GPT2Tokenizer.from_pretrained("littleworth/protgpt2-distilled-medium")

generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)

sequences = generator(
    "<|endoftext|>",
    max_length=256,
    do_sample=True,
    top_k=950,
    repetition_penalty=1.2,
    num_return_sequences=5,
    eos_token_id=0,
    pad_token_id=0,
    truncation=True,
)

for i, seq in enumerate(sequences):
    protein = seq["generated_text"].replace("<|endoftext|>", "").replace("\n", "")
    protein = "".join(c for c in protein if c.isalpha())
    print(f">Generated_{i}\n{protein}")

How It Works

This model was trained using complementary-regularizer distillation, which augments standard temperature-scaled knowledge distillation (Hinton et al., 2015) with two protein-specific enhancements:

Uncertainty-aware position weighting --- Uses teacher entropy to emphasize biologically variable regions (loops, surface residues) during distillation, directing learning capacity toward positions where the teacher's distributional knowledge is richest.
Calibration-aware label smoothing --- Applies confidence-dependent smoothing to teacher distributions, acting as a noise filter that removes miscalibration artifacts while preserving genuine amino acid substitution preferences.

The key finding: Each enhancement individually degrades distillation quality (+95% and +109% perplexity increase, respectively), yet their combination yields a 53% perplexity improvement over baseline---a phenomenon we call complementary regularizers. Smoothing removes the noise that weighting would amplify, while weighting compensates for the signal attenuation that smoothing introduces.

Performance

Compared to Baseline Knowledge Distillation

Method	PPL Ratio	ECE	KL Divergence
Baseline KD	3.72	0.169	1.34
This model (complementary regularizers)	2.58	0.135	1.47
Improvement	31%	20%	---

Model Family Comparison

Model	Params	Compression	PPL Ratio	Speedup	GPU Memory
ProtGPT2 (teacher)	738M	1x	1.00	1.0x	3,211 MB
Tiny	37M	20x	5.06	5.3x	170 MB
Small	78M	9.4x	7.05	4.1x	343 MB
Medium (this model)	194M	3.8x	2.58	2.4x	836 MB

Biological Validity

Generated sequences produce amino acid distributions closely matching natural proteins (KL divergence from UniProt < 0.015), confirming that compressed models preserve biologically realistic sequence statistics. ESMFold pLDDT scores match the baseline at this scale (38.1 vs. 38.1), confirming structural quality is preserved.

When to Use This Model

Highest fidelity: Best perplexity ratio (2.58) in the model family---closest to teacher quality
Best calibration: Lowest ECE (0.135), important when model confidence guides experimental decisions
Quality-critical applications: When sequence quality matters more than throughput
On-premise inference: 836 MB fits comfortably on standard GPUs, 2.4x faster than the teacher
Best fine-tuning quality: Dominates on short peptide families like conotoxins (PPL 30 vs teacher's 54, HMMER 42.5% vs 8.0%), with 2x sample efficiency---matches the teacher's N=100 performance using only N=50 training sequences

For maximum speed, consider the Tiny variant (5.3x speedup, 170 MB). For a balance of speed and quality, consider the Small variant.

Fine-Tuning on Custom Protein Families

This model serves as a superior starting point for domain adaptation compared to the full-size teacher. On conotoxin, it achieves PPL 30 versus the teacher's 54 at N=1,000 and HMMER hit rate of 42.5% versus 8.0%. On lysozyme, it reaches 83.5% HMMER hit rate versus the teacher's 69%.

2x sample efficiency: This model at N=50 training sequences outperforms the teacher at N=100 on conotoxin (PPL 372 vs 1,153), meaning you need half the training data to reach the same quality.

This advantage stems from the synergy distillation method itself, not just model compression---a standard-distilled model with the same architecture performs at teacher level, while synergy-distilled models far exceed both.

from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset

model_name = "littleworth/protgpt2-distilled-medium"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Prepare your protein sequences as a list of strings
sequences = ["MKTLLILAVL...", "MKFLILLFNL..."]  # your family sequences

dataset = Dataset.from_dict({"text": sequences})
dataset = dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, max_length=512),
    batched=True, remove_columns=["text"],
)

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./finetuned-model",
        num_train_epochs=20,
        per_device_train_batch_size=8,
        learning_rate=5e-5,
        lr_scheduler_type="cosine",
        warmup_steps=100,
        fp16=True,
        eval_strategy="epoch",
    ),
    train_dataset=dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()
trainer.save_model("./finetuned-model")

Recommended fine-tuning hyperparameters for this model:

Parameter	Value
Learning rate	5e-5
Batch size	8
Scheduler	Cosine with 100 warmup steps
Early stopping	Patience 3 on eval loss
Precision	FP16
Gradient checkpointing	Not needed

Training Details

Parameter	Value
Teacher model	nferruz/ProtGPT2 (738M)
Training data	10% UniProt subset (Parquet)
Temperature (T)	2.0
Alpha	0.5
Learning rate	5e-5 (with 500-step linear warmup)
Epochs	3
Batch size	32 (effective)
Optimizer	AdamW
Precision	FP16
Uncertainty weighting	Enabled
Calibration smoothing	Enabled (lambda=0.1)

Citation

@article {Wijaya2026.02.17.706304,
    author = {Wijaya, Edward},
    title = {Distilling Protein Language Models with Complementary Regularizers},
    elocation-id = {2026.02.17.706304},
    year = {2026},
    doi = {10.64898/2026.02.17.706304},
    publisher = {Cold Spring Harbor Laboratory},
    abstract = {Large autoregressive protein language models generate novel sequences de novo, but their size limits throughput and precludes rapid domain adaptation on scarce proprietary data. We distill a 738M-parameter protein language model into compact students using two protein-specific enhancements, uncertainty-aware position weighting and calibration-aware label smoothing, that individually degrade quality yet combine for substantial improvement. We trace this complementary-regularizer effect to information theory: smoothing denoises teacher distributions while weighting amplifies the cleaned signal at biologically variable positions. Students achieve up to 5x inference speedup, preserve natural amino acid distributions, and require as little as 170 MB of GPU memory, enabling deployment on consumer-grade hardware. When fine-tuned on protein families with as few as 50 sequences, students generate more family-matching sequences than the teacher, achieving higher sample efficiency and Pfam hit rates despite their smaller capacity. These results establish distilled protein language models as superior starting points for domain adaptation on scarce data.Competing Interest StatementThe authors have declared no competing interest.},
    URL = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304},
    eprint = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304.full.pdf},
    journal = {bioRxiv}
}

Related Models

ProtGPT2 --- the teacher model
protgpt2-distilled-tiny --- 37M parameters, 20x compression
protgpt2-distilled-small --- 78M parameters, 9.4x compression

License

Apache 2.0

Downloads last month: 66

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for littleworth/protgpt2-distilled-medium

Base model

nferruz/ProtGPT2

Finetuned

(22)

this model

Dataset used to train littleworth/protgpt2-distilled-medium

Paper for littleworth/protgpt2-distilled-medium

Distilling the Knowledge in a Neural Network

Paper • 1503.02531 • Published Mar 9, 2015 • 2