--- language: - en tags: - protein-language-model - antibody - immunology - masked-language-model - transformer - roberta - CDRH3 license: mit datasets: - OAS pipeline_tag: fill-mask model-index: - name: H3BERTa results: [] --- # H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis **Model ID:** `Chrode/H3BERTa` **Architecture:** RoBERTa-base (encoder-only, Masked Language Model) **Sequence type:** Heavy chain CDR-H3 regions **Training:** Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources) **Max sequence length:** 100 amino acids **Vocabulary:** 25 tokens (20 standard amino acids + special tokens) **Mask token:** `[MASK]` --- Official github repository is available [here](https://github.com/ibmm-unibe-ch/H3BERTa). ## Model Overview H3BERTa is a transformer-based language model trained specifically on the **Complementarity-Determining Region 3 of the heavy chain (CDR-H3)**, the most diverse and functionally critical region of antibodies. It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling **embedding extraction**, **variant scoring**, and **context-aware mutation predictions**. --- ## Intended Use - Embedding extraction for CDR-H3 repertoire analysis - Mutation impact scoring (pseudo-likelihood estimation) - Downstream fine-tuning (e.g., bnabs identification) --- ## How to Use **Input format**: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids. ```python from transformers import AutoTokenizer, AutoModel model_id = "Chrode/H3BERTa" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id) ``` ### Example #1: Embeddings extraction Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models. ```python from transformers import pipeline import torch, numpy as np feat = pipeline( task="feature-extraction", model="Chrode/H3BERTa", tokenizer="Chrode/H3BERTa", device=0 if torch.cuda.is_available() else -1 ) seqs = [ "ARMGAAREWDFQY", "ARDGLGEVAPDYRYGIDV" ] with torch.no_grad(): outs = feat(seqs) # Mean pooling across tokens → per-sequence embedding embs = [np.array(o).mean(axis=0) for o in outs] print(len(embs), embs[0].shape) ``` ### Example #2: Masked-Language Modeling (Mutation Scoring) Predict likely amino acids for masked positions or evaluate single-site mutations. ```python from transformers import pipeline, AutoTokenizer model_id = "Chrode/H3BERTa" tok = AutoTokenizer.from_pretrained(model_id) mlm = pipeline( task="fill-mask", model=model_id, tokenizer=tok, device=0 ) # Example: predict missing residue seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token) preds = mlm(seq, top_k=10) for p in preds: print(p["token_str"], round(p["score"], 4)) # Score a specific point mutation AMINO = list("ACDEFGHIKLMNPQRSTVWY") def score_point_mutation(seq, idx, mutant_aa): masked = seq[:idx] + tok.mask_token + seq[idx+1:] preds = mlm(masked, top_k=len(AMINO)) for p in preds: if p["token_str"] == mutant_aa: return p["score"] return 0.0 wt = "ARDRSTGGYFDY" print("R→A @ pos 3:", score_point_mutation(wt, 3, "A")) ``` --- # Citation If you use this model, please cite: Rodella C. et al. H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis. - under review. --- # License The model and tokenizer are released under the MIT License. For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.