File size: 3,831 Bytes

---
language: 
  - en
tags:
  - protein-language-model
  - antibody
  - immunology
  - masked-language-model
  - transformer
  - roberta
  - CDRH3
license: mit
datasets:
  - OAS
pipeline_tag: fill-mask
model-index:
  - name: H3BERTa
    results: []
---

# H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis

**Model ID:** `Chrode/H3BERTa`  
**Architecture:** RoBERTa-base (encoder-only, Masked Language Model)  
**Sequence type:** Heavy chain CDR-H3 regions  
**Training:** Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)    
**Max sequence length:** 100 amino acids  
**Vocabulary:** 25 tokens (20 standard amino acids + special tokens)  
**Mask token:** `[MASK]`

---

Official github repository is available [here](https://github.com/ibmm-unibe-ch/H3BERTa). 
## Model Overview

H3BERTa is a transformer-based language model trained specifically on the **Complementarity-Determining Region 3 of the heavy chain (CDR-H3)**, the most diverse and functionally critical region of antibodies.  
It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling **embedding extraction**, **variant scoring**, and **context-aware mutation predictions**.

---

## Intended Use

- Embedding extraction for CDR-H3 repertoire analysis  
- Mutation impact scoring (pseudo-likelihood estimation)  
- Downstream fine-tuning (e.g., bnabs identification)  

---

## How to Use

**Input format**: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids.

```python
from transformers import AutoTokenizer, AutoModel

model_id = "Chrode/H3BERTa"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
```

### Example #1: Embeddings extraction

Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.
```python
from transformers import pipeline
import torch, numpy as np

feat = pipeline(
    task="feature-extraction",
    model="Chrode/H3BERTa",
    tokenizer="Chrode/H3BERTa",
    device=0 if torch.cuda.is_available() else -1
)

seqs = [
    "ARMGAAREWDFQY",
    "ARDGLGEVAPDYRYGIDV"
]

with torch.no_grad():
    outs = feat(seqs)

# Mean pooling across tokens → per-sequence embedding
embs = [np.array(o).mean(axis=0) for o in outs]
print(len(embs), embs[0].shape)
```

### Example #2: Masked-Language Modeling (Mutation Scoring)

Predict likely amino acids for masked positions or evaluate single-site mutations.

```python
from transformers import pipeline, AutoTokenizer

model_id = "Chrode/H3BERTa"
tok = AutoTokenizer.from_pretrained(model_id)

mlm = pipeline(
    task="fill-mask",
    model=model_id,
    tokenizer=tok,
    device=0
)

# Example: predict missing residue
seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
preds = mlm(seq, top_k=10)

for p in preds:
    print(p["token_str"], round(p["score"], 4))

# Score a specific point mutation
AMINO = list("ACDEFGHIKLMNPQRSTVWY")

def score_point_mutation(seq, idx, mutant_aa):
    masked = seq[:idx] + tok.mask_token + seq[idx+1:]
    preds = mlm(masked, top_k=len(AMINO))
    for p in preds:
        if p["token_str"] == mutant_aa:
            return p["score"]
    return 0.0

wt = "ARDRSTGGYFDY"
print("R→A @ pos 3:", score_point_mutation(wt, 3, "A"))
```
---
# Citation

If you use this model, please cite:

Rodella C. et al.
H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.
- under review.

---

#  License

The model and tokenizer are released under the MIT License.
For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.