Chrode commited on
Commit
f0a298a
·
verified ·
1 Parent(s): 27a684a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - protein-language-model
6
+ - antibody
7
+ - immunology
8
+ - masked-language-model
9
+ - transformer
10
+ - roberta
11
+ - CDRH3
12
+ license: mit
13
+ datasets:
14
+ - OAS
15
+ pipeline_tag: fill-mask
16
+ model-index:
17
+ - name: H3BERTa
18
+ results: []
19
+ ---
20
+
21
+ # H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis
22
+
23
+ **Model ID:** `Chrode/H3BERTa`
24
+ **Architecture:** RoBERTa-base (encoder-only, Masked Language Model)
25
+ **Sequence type:** Heavy chain CDR-H3 regions
26
+ **Training:** Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)
27
+ **Max sequence length:** 100 amino acids
28
+ **Vocabulary:** 25 tokens (20 standard amino acids + special tokens)
29
+ **Mask token:** `[MASK]`
30
+
31
+ ---
32
+
33
+ ## Model Overview
34
+
35
+ H3BERTa is a transformer-based language model trained specifically on the **Complementarity-Determining Region 3 of the heavy chain (CDR-H3)**, the most diverse and functionally critical region of antibodies.
36
+ It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling **embedding extraction**, **variant scoring**, and **context-aware mutation predictions**.
37
+
38
+ ---
39
+
40
+ ## Intended Use
41
+
42
+ - Embedding extraction for CDR-H3 repertoire analysis
43
+ - Mutation impact scoring (pseudo-likelihood estimation)
44
+ - Downstream fine-tuning (e.g., bnabs identification)
45
+
46
+ ---
47
+
48
+ ## How to Use
49
+
50
+ **Input format**: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids.
51
+
52
+ ```python
53
+ from transformers import AutoTokenizer, AutoModel
54
+
55
+ model_id = "Chrode/H3BERTa"
56
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
57
+ model = AutoModel.from_pretrained(model_id)
58
+ ```
59
+
60
+ ### Example #1: Embeddings extraction
61
+
62
+ Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.
63
+ ```python
64
+ from transformers import pipeline
65
+ import torch, numpy as np
66
+
67
+ feat = pipeline(
68
+ task="feature-extraction",
69
+ model="Chrode/H3BERTa",
70
+ tokenizer="Chrode/H3BERTa",
71
+ device=0 if torch.cuda.is_available() else -1
72
+ )
73
+
74
+ seqs = [
75
+ "ARMGAAREWDFQY",
76
+ "ARDGLGEVAPDYRYGIDV"
77
+ ]
78
+
79
+ with torch.no_grad():
80
+ outs = feat(seqs)
81
+
82
+ # Mean pooling across tokens → per-sequence embedding
83
+ embs = [np.array(o).mean(axis=0) for o in outs]
84
+ print(len(embs), embs[0].shape)
85
+ ```
86
+
87
+ ### Example #2: Masked-Language Modeling (Mutation Scoring)
88
+
89
+ Predict likely amino acids for masked positions or evaluate single-site mutations.
90
+
91
+ ```python
92
+ from transformers import pipeline, AutoTokenizer
93
+
94
+ model_id = "Chrode/H3BERTa"
95
+ tok = AutoTokenizer.from_pretrained(model_id)
96
+
97
+ mlm = pipeline(
98
+ task="fill-mask",
99
+ model=model_id,
100
+ tokenizer=tok,
101
+ device=0
102
+ )
103
+
104
+ # Example: predict missing residue
105
+ seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
106
+ preds = mlm(seq, top_k=10)
107
+
108
+ for p in preds:
109
+ print(p["token_str"], round(p["score"], 4))
110
+
111
+ # Score a specific point mutation
112
+ AMINO = list("ACDEFGHIKLMNPQRSTVWY")
113
+
114
+ def score_point_mutation(seq, idx, mutant_aa):
115
+ masked = seq[:idx] + tok.mask_token + seq[idx+1:]
116
+ preds = mlm(masked, top_k=len(AMINO))
117
+ for p in preds:
118
+ if p["token_str"] == mutant_aa:
119
+ return p["score"]
120
+ return 0.0
121
+
122
+ wt = "ARDRSTGGYFDY"
123
+ print("R→A @ pos 3:", score_point_mutation(wt, 3, "A"))
124
+ ```
125
+ ---
126
+ # Citation
127
+
128
+ If you use this model, please cite:
129
+
130
+ Rodella C. et al.
131
+ H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.
132
+ Patterns (2025) — under review.
133
+
134
+ ---
135
+
136
+ # License
137
+
138
+ The model and tokenizer are released under the MIT License.
139
+ For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.