|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- protein-language-model |
|
|
- antibody |
|
|
- immunology |
|
|
- masked-language-model |
|
|
- transformer |
|
|
- roberta |
|
|
- CDRH3 |
|
|
license: mit |
|
|
datasets: |
|
|
- OAS |
|
|
pipeline_tag: fill-mask |
|
|
model-index: |
|
|
- name: H3BERTa |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis |
|
|
|
|
|
**Model ID:** `Chrode/H3BERTa` |
|
|
**Architecture:** RoBERTa-base (encoder-only, Masked Language Model) |
|
|
**Sequence type:** Heavy chain CDR-H3 regions |
|
|
**Training:** Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources) |
|
|
**Max sequence length:** 100 amino acids |
|
|
**Vocabulary:** 25 tokens (20 standard amino acids + special tokens) |
|
|
**Mask token:** `[MASK]` |
|
|
|
|
|
--- |
|
|
|
|
|
Official github repository is available [here](https://github.com/ibmm-unibe-ch/H3BERTa). |
|
|
## Model Overview |
|
|
|
|
|
H3BERTa is a transformer-based language model trained specifically on the **Complementarity-Determining Region 3 of the heavy chain (CDR-H3)**, the most diverse and functionally critical region of antibodies. |
|
|
It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling **embedding extraction**, **variant scoring**, and **context-aware mutation predictions**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Embedding extraction for CDR-H3 repertoire analysis |
|
|
- Mutation impact scoring (pseudo-likelihood estimation) |
|
|
- Downstream fine-tuning (e.g., bnabs identification) |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
**Input format**: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
model_id = "Chrode/H3BERTa" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModel.from_pretrained(model_id) |
|
|
``` |
|
|
|
|
|
### Example #1: Embeddings extraction |
|
|
|
|
|
Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models. |
|
|
```python |
|
|
from transformers import pipeline |
|
|
import torch, numpy as np |
|
|
|
|
|
feat = pipeline( |
|
|
task="feature-extraction", |
|
|
model="Chrode/H3BERTa", |
|
|
tokenizer="Chrode/H3BERTa", |
|
|
device=0 if torch.cuda.is_available() else -1 |
|
|
) |
|
|
|
|
|
seqs = [ |
|
|
"ARMGAAREWDFQY", |
|
|
"ARDGLGEVAPDYRYGIDV" |
|
|
] |
|
|
|
|
|
with torch.no_grad(): |
|
|
outs = feat(seqs) |
|
|
|
|
|
# Mean pooling across tokens → per-sequence embedding |
|
|
embs = [np.array(o).mean(axis=0) for o in outs] |
|
|
print(len(embs), embs[0].shape) |
|
|
``` |
|
|
|
|
|
### Example #2: Masked-Language Modeling (Mutation Scoring) |
|
|
|
|
|
Predict likely amino acids for masked positions or evaluate single-site mutations. |
|
|
|
|
|
```python |
|
|
from transformers import pipeline, AutoTokenizer |
|
|
|
|
|
model_id = "Chrode/H3BERTa" |
|
|
tok = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
mlm = pipeline( |
|
|
task="fill-mask", |
|
|
model=model_id, |
|
|
tokenizer=tok, |
|
|
device=0 |
|
|
) |
|
|
|
|
|
# Example: predict missing residue |
|
|
seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token) |
|
|
preds = mlm(seq, top_k=10) |
|
|
|
|
|
for p in preds: |
|
|
print(p["token_str"], round(p["score"], 4)) |
|
|
|
|
|
# Score a specific point mutation |
|
|
AMINO = list("ACDEFGHIKLMNPQRSTVWY") |
|
|
|
|
|
def score_point_mutation(seq, idx, mutant_aa): |
|
|
masked = seq[:idx] + tok.mask_token + seq[idx+1:] |
|
|
preds = mlm(masked, top_k=len(AMINO)) |
|
|
for p in preds: |
|
|
if p["token_str"] == mutant_aa: |
|
|
return p["score"] |
|
|
return 0.0 |
|
|
|
|
|
wt = "ARDRSTGGYFDY" |
|
|
print("R→A @ pos 3:", score_point_mutation(wt, 3, "A")) |
|
|
``` |
|
|
--- |
|
|
# Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
Rodella C. et al. |
|
|
H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis. |
|
|
- under review. |
|
|
|
|
|
--- |
|
|
|
|
|
# License |
|
|
|
|
|
The model and tokenizer are released under the MIT License. |
|
|
For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration. |