H3BERTa / README.md

Update README.md

7594a69 verified 3 months ago

3.83 kB

	---
	language:
	- en
	tags:
	- protein-language-model
	- antibody
	- immunology
	- masked-language-model
	- transformer
	- roberta
	- CDRH3
	license: mit
	datasets:
	- OAS
	pipeline_tag: fill-mask
	model-index:
	- name: H3BERTa
	results: []
	---

	# H3BERTa: A CDR-H3-specific Language Model for Antibody Repertoire Analysis

	Model ID: `Chrode/H3BERTa`
	Architecture: RoBERTa-base (encoder-only, Masked Language Model)
	Sequence type: Heavy chain CDR-H3 regions
	Training: Pretrained on >17M curated CDR-H3 sequences from healthy donor repertoires (OAS, IgG/IgA sources)
	Max sequence length: 100 amino acids
	Vocabulary: 25 tokens (20 standard amino acids + special tokens)
	Mask token: `[MASK]`

	---

	Official github repository is available [here](https://github.com/ibmm-unibe-ch/H3BERTa).
	## Model Overview

	H3BERTa is a transformer-based language model trained specifically on the Complementarity-Determining Region 3 of the heavy chain (CDR-H3), the most diverse and functionally critical region of antibodies.
	It captures the statistical regularities and biophysical constraints underlying natural antibody repertoires, enabling embedding extraction, variant scoring, and context-aware mutation predictions.

	---

	## Intended Use

	- Embedding extraction for CDR-H3 repertoire analysis
	- Mutation impact scoring (pseudo-likelihood estimation)
	- Downstream fine-tuning (e.g., bnabs identification)

	---

	## How to Use

	Input format: CDR-H3 sequences must be provided as plain amino acid strings (e.g., "ARDRSTGGYFDY") without the initial “C” or terminal “W” residues, and without whitespace or separators between amino acids.

	```python
	from transformers import AutoTokenizer, AutoModel

	model_id = "Chrode/H3BERTa"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModel.from_pretrained(model_id)
	```

	### Example #1: Embeddings extraction

	Extract per-sequence embeddings useful for clustering, similarity search, or downstream ML models.
	```python
	from transformers import pipeline
	import torch, numpy as np

	feat = pipeline(
	task="feature-extraction",
	model="Chrode/H3BERTa",
	tokenizer="Chrode/H3BERTa",
	device=0 if torch.cuda.is_available() else -1
	)

	seqs = [
	"ARMGAAREWDFQY",
	"ARDGLGEVAPDYRYGIDV"
	]

	with torch.no_grad():
	outs = feat(seqs)

	# Mean pooling across tokens → per-sequence embedding
	embs = [np.array(o).mean(axis=0) for o in outs]
	print(len(embs), embs[0].shape)
	```

	### Example #2: Masked-Language Modeling (Mutation Scoring)

	Predict likely amino acids for masked positions or evaluate single-site mutations.

	```python
	from transformers import pipeline, AutoTokenizer

	model_id = "Chrode/H3BERTa"
	tok = AutoTokenizer.from_pretrained(model_id)

	mlm = pipeline(
	task="fill-mask",
	model=model_id,
	tokenizer=tok,
	device=0
	)

	# Example: predict missing residue
	seq = "CARDRS[MASK]GGYFDYW".replace("[MASK]", tok.mask_token)
	preds = mlm(seq, top_k=10)

	for p in preds:
	print(p["token_str"], round(p["score"], 4))

	# Score a specific point mutation
	AMINO = list("ACDEFGHIKLMNPQRSTVWY")

	def score_point_mutation(seq, idx, mutant_aa):
	masked = seq[:idx] + tok.mask_token + seq[idx+1:]
	preds = mlm(masked, top_k=len(AMINO))
	for p in preds:
	if p["token_str"] == mutant_aa:
	return p["score"]
	return 0.0

	wt = "ARDRSTGGYFDY"
	print("R→A @ pos 3:", score_point_mutation(wt, 3, "A"))
	```
	---
	# Citation

	If you use this model, please cite:

	Rodella C. et al.
	H3BERTa: A CDR-H3-specific language model for antibody repertoire analysis.
	- under review.

	---

	# License

	The model and tokenizer are released under the MIT License.
	For commercial or large-scale applications, please contact the authors to discuss licensing or collaboration.