File size: 3,831 Bytes
f0a298a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d11e21e
 
f0a298a
 
 
 
 
 
7594a69
f0a298a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d11e21e
f0a298a
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
language: 
  - en
tags:
  - protein-language-model
  - antibody
  - immunology
  - masked-language-model
  - transformer
  - roberta
  - CDRH3
license: mit
datasets:
  - OAS
pipeline_tag: fill-mask
model-index:
  - name: H3BERTa
    results: []
---

# H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis

**Model ID:** `Chrode/H3BERTa`  
**Architecture:** RoBERTa-base (encoder-only, Masked Language Model)  
**Sequence type:** Heavy chain CDR-H3 regions  
**Training:** Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)    
**Max sequence length:** 100 amino acids  
**Vocabulary:** 25 tokens (20 standard amino acids + special tokens)  
**Mask token:** `[MASK]`

---

Official github repository is available [here](https://github.com/ibmm-unibe-ch/H3BERTa). 
## Model Overview

H3BERTa is a transformer-based language model trained specifically on the **Complementarity-Determining Region 3 of the heavy chain (CDR-H3)**, the most diverse and functionally critical region of antibodies.  
It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling **embedding extraction**, **variant scoring**, and **context-aware mutation predictions**.

---

## Intended Use

- Embedding extraction for CDR-H3 repertoire analysis  
- Mutation impact scoring (pseudo-likelihood estimation)  
- Downstream fine-tuning (e.g., bnabs identification)  

---

## How to Use

**Input format**: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids.

```python
from transformers import AutoTokenizer, AutoModel

model_id = "Chrode/H3BERTa"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
```

### Example #1: Embeddings extraction

Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.
```python
from transformers import pipeline
import torch, numpy as np

feat = pipeline(
    task="feature-extraction",
    model="Chrode/H3BERTa",
    tokenizer="Chrode/H3BERTa",
    device=0 if torch.cuda.is_available() else -1
)

seqs = [
    "ARMGAAREWDFQY",
    "ARDGLGEVAPDYRYGIDV"
]

with torch.no_grad():
    outs = feat(seqs)

# Mean pooling across tokens → per-sequence embedding
embs = [np.array(o).mean(axis=0) for o in outs]
print(len(embs), embs[0].shape)
```

### Example #2: Masked-Language Modeling (Mutation Scoring)

Predict likely amino acids for masked positions or evaluate single-site mutations.

```python
from transformers import pipeline, AutoTokenizer

model_id = "Chrode/H3BERTa"
tok = AutoTokenizer.from_pretrained(model_id)

mlm = pipeline(
    task="fill-mask",
    model=model_id,
    tokenizer=tok,
    device=0
)

# Example: predict missing residue
seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
preds = mlm(seq, top_k=10)

for p in preds:
    print(p["token_str"], round(p["score"], 4))

# Score a specific point mutation
AMINO = list("ACDEFGHIKLMNPQRSTVWY")

def score_point_mutation(seq, idx, mutant_aa):
    masked = seq[:idx] + tok.mask_token + seq[idx+1:]
    preds = mlm(masked, top_k=len(AMINO))
    for p in preds:
        if p["token_str"] == mutant_aa:
            return p["score"]
    return 0.0

wt = "ARDRSTGGYFDY"
print("R→A @ pos 3:", score_point_mutation(wt, 3, "A"))
```
---
# Citation

If you use this model, please cite:

Rodella C. et al.
H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.
- under review.

---

#  License

The model and tokenizer are released under the MIT License.
For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.