Upload AlignAIR 3.0 pretrained models (5 models)

2e422e0 verified 1 day ago

4.43 kB

	---
	language: en
	license: gpl-3.0
	tags:
	- immunoinformatics
	- antibody
	- TCR
	- AIRR
	- sequence-alignment
	- bioinformatics
	- pytorch
	library_name: alignair
	pipeline_tag: token-classification
	---

	# AlignAIR Pretrained Models

	AlignAIR is a deep learning tool for aligning immunoglobulin (IG) and T-cell receptor (TCR) sequences to germline gene databases. It simultaneously predicts V/D/J gene assignments, segment boundaries, mutation rates, and productivity — all in a single forward pass.

	## Available Models

	\| Model \| Chain \| Germline DB \| V Alleles \| D Alleles \| J Alleles \| Size \|
	\|-------\|-------\|-------------\|-----------\|-----------\|-----------\|------\|
	\| `HUMAN_IGH_OGRDB_576` \| IGH (Heavy) \| OGRDB \| 198 \| 33 \| 7 \| 17 MB \|
	\| `HUMAN_IGH_EXTENDED_576` \| IGH (Heavy) \| Extended \| 342 \| 37 \| 10 \| 28 MB \|
	\| `HUMAN_IGK_OGRDB_576` \| IGK (Kappa) \| OGRDB \| 168 \| — \| 8 \| 12 MB \|
	\| `HUMAN_IGL_OGRDB_576` \| IGL (Lambda) \| OGRDB \| 181 \| — \| 10 \| 13 MB \|
	\| `HUMAN_TCRB_IMGT_576` \| TCRB (Beta) \| IMGT \| 130 \| 3 \| 14 \| 12 MB \|

	All models use a maximum sequence length of 576 nucleotides and were trained on 1000 epochs of synthetic data generated by [GenAIRR](https://github.com/MuteJester/GenAIRR).

	## Quick Start

	```bash
	pip install alignair[hub]
	```

	### Python API

	```python
	from AlignAIR.Models import SingleChainAlignAIR
	from AlignAIR.Hub import get_model_path

	# Download and load a model (cached automatically)
	model_path = get_model_path("igh") # or "HUMAN_IGH_OGRDB_576"
	model = SingleChainAlignAIR.from_pretrained(model_path)
	```

	### CLI

	```bash
	# Run inference with a pretrained model
	alignair --model-dir HUMAN_IGH_OGRDB_576 input_sequences.csv -o results/
	```

	## Benchmark Results (100K synthetic sequences)

	\| Model \| AlignAIR V \| IgBLAST V \| AlignAIR D \| IgBLAST D \| AlignAIR J \| IgBLAST J \| AlignAIR Speed \|
	\|-------\|-----------\|-----------\|-----------\|-----------\|-----------\|-----------\|----------------\|
	\| IGH OGRDB \| 94.1% \| 95.5% \| 81.7% \| 69.8% \| 99.3% \| 99.5% \| 4,272 seq/s \|
	\| IGH Extended \| 92.3% \| 93.9% \| 88.5% \| 82.6% \| 98.7% \| 98.4% \| 4,245 seq/s \|
	\| IGK OGRDB \| 94.6% \| 95.4% \| — \| — \| 97.2% \| 96.0% \| 4,807 seq/s \|
	\| IGL OGRDB \| 93.9% \| 95.3% \| — \| — \| 98.4% \| 96.7% \| 5,384 seq/s \|
	\| TCRB IMGT \| 96.5% \| 96.2% \| 89.6% \| 76.3% \| 99.6% \| 99.1% \| 4,317 seq/s \|

	Speed measured on NVIDIA RTX 3090 Ti (GPU) vs IgBLAST 1.22.0 (8 CPU threads).

	## Model Architecture

	Each model is a `SingleChainAlignAIR` module combining:
	- Nucleotide embedding (5→64 dim) with center-padded tokenization
	- Spatial segmentation via 9-layer dilated convolutions (receptive field = 1023 nt)
	- Conditioned boundary heads with chain decoding (v_start → v_end → d_start → ...)
	- Classification heads for V/D/J allele assignment
	- Analysis heads for mutation rate and productivity prediction
	- In-model orientation correction (4-class: forward, reverse-complement, complement, reverse)

	## Bundle Format

	Each model directory contains:
	- `model.pt` — PyTorch state dict
	- `config.json` — Architecture hyperparameters
	- `dataconfig.pkl` — Germline allele database (GenAIRR DataConfig)
	- `training_meta.json` — Training provenance
	- `VERSION` — Bundle format version
	- `fingerprint.txt` — SHA-256 integrity hash

	## Citation

	If you use AlignAIR in your research, please cite:

	> Konstantinovsky, T., Peres, A., Eisenberg, R., Polak, P., Lindenbaum, O., & Yaari, G. (2025). Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning. Nucleic Acids Research, 53(13). https://doi.org/10.1093/nar/gkaf651

	```bibtex
	@article{Konstantinovsky2025,
	title = {Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning},
	volume = {53},
	ISSN = {1362-4962},
	url = {http://dx.doi.org/10.1093/nar/gkaf651},
	DOI = {10.1093/nar/gkaf651},
	number = {13},
	journal = {Nucleic Acids Research},
	publisher = {Oxford University Press (OUP)},
	author = {Konstantinovsky, Thomas and Peres, Ayelet and Eisenberg, Ran and Polak, Pazit and Lindenbaum, Ofir and Yaari, Gur},
	year = {2025},
	month = jul
	}
	```

	## License

	GPL-3.0. See [LICENSE](https://github.com/MuteJester/AlignAIR/blob/main/LICENSE).

	## Links

	- [GitHub Repository](https://github.com/MuteJester/AlignAIR)
	- [Documentation](https://mutejester.github.io/AlignAIR/)
	- [PyPI Package](https://pypi.org/project/AlignAIR/)