File size: 3,831 Bytes
f0a298a d11e21e f0a298a 7594a69 f0a298a d11e21e f0a298a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
language:
- en
tags:
- protein-language-model
- antibody
- immunology
- masked-language-model
- transformer
- roberta
- CDRH3
license: mit
datasets:
- OAS
pipeline_tag: fill-mask
model-index:
- name: H3BERTa
results: []
---
# H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis
**Model ID:** `Chrode/H3BERTa`
**Architecture:** RoBERTa-base (encoder-only, Masked Language Model)
**Sequence type:** Heavy chain CDR-H3 regions
**Training:** Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)
**Max sequence length:** 100 amino acids
**Vocabulary:** 25 tokens (20 standard amino acids + special tokens)
**Mask token:** `[MASK]`
---
Official github repository is available [here](https://github.com/ibmm-unibe-ch/H3BERTa).
## Model Overview
H3BERTa is a transformer-based language model trained specifically on the **Complementarity-Determining Region 3 of the heavy chain (CDR-H3)**, the most diverse and functionally critical region of antibodies.
It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling **embedding extraction**, **variant scoring**, and **context-aware mutation predictions**.
---
## Intended Use
- Embedding extraction for CDR-H3 repertoire analysis
- Mutation impact scoring (pseudo-likelihood estimation)
- Downstream fine-tuning (e.g., bnabs identification)
---
## How to Use
**Input format**: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids.
```python
from transformers import AutoTokenizer, AutoModel
model_id = "Chrode/H3BERTa"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
```
### Example #1: Embeddings extraction
Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.
```python
from transformers import pipeline
import torch, numpy as np
feat = pipeline(
task="feature-extraction",
model="Chrode/H3BERTa",
tokenizer="Chrode/H3BERTa",
device=0 if torch.cuda.is_available() else -1
)
seqs = [
"ARMGAAREWDFQY",
"ARDGLGEVAPDYRYGIDV"
]
with torch.no_grad():
outs = feat(seqs)
# Mean pooling across tokens → per-sequence embedding
embs = [np.array(o).mean(axis=0) for o in outs]
print(len(embs), embs[0].shape)
```
### Example #2: Masked-Language Modeling (Mutation Scoring)
Predict likely amino acids for masked positions or evaluate single-site mutations.
```python
from transformers import pipeline, AutoTokenizer
model_id = "Chrode/H3BERTa"
tok = AutoTokenizer.from_pretrained(model_id)
mlm = pipeline(
task="fill-mask",
model=model_id,
tokenizer=tok,
device=0
)
# Example: predict missing residue
seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
preds = mlm(seq, top_k=10)
for p in preds:
print(p["token_str"], round(p["score"], 4))
# Score a specific point mutation
AMINO = list("ACDEFGHIKLMNPQRSTVWY")
def score_point_mutation(seq, idx, mutant_aa):
masked = seq[:idx] + tok.mask_token + seq[idx+1:]
preds = mlm(masked, top_k=len(AMINO))
for p in preds:
if p["token_str"] == mutant_aa:
return p["score"]
return 0.0
wt = "ARDRSTGGYFDY"
print("R→A @ pos 3:", score_point_mutation(wt, 3, "A"))
```
---
# Citation
If you use this model, please cite:
Rodella C. et al.
H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.
- under review.
---
# License
The model and tokenizer are released under the MIT License.
For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration. |