| | --- |
| | language: en |
| | license: gpl-3.0 |
| | tags: |
| | - immunoinformatics |
| | - antibody |
| | - TCR |
| | - AIRR |
| | - sequence-alignment |
| | - bioinformatics |
| | - pytorch |
| | library_name: alignair |
| | pipeline_tag: token-classification |
| | --- |
| | |
| | # AlignAIR Pretrained Models |
| |
|
| | **AlignAIR** is a deep learning tool for aligning immunoglobulin (IG) and T-cell receptor (TCR) sequences to germline gene databases. It simultaneously predicts V/D/J gene assignments, segment boundaries, mutation rates, and productivity β all in a single forward pass. |
| |
|
| | ## Available Models |
| |
|
| | | Model | Chain | Germline DB | V Alleles | D Alleles | J Alleles | Size | |
| | |-------|-------|-------------|-----------|-----------|-----------|------| |
| | | `HUMAN_IGH_OGRDB_576` | IGH (Heavy) | OGRDB | 198 | 33 | 7 | 17 MB | |
| | | `HUMAN_IGH_EXTENDED_576` | IGH (Heavy) | Extended | 342 | 37 | 10 | 28 MB | |
| | | `HUMAN_IGK_OGRDB_576` | IGK (Kappa) | OGRDB | 168 | β | 8 | 12 MB | |
| | | `HUMAN_IGL_OGRDB_576` | IGL (Lambda) | OGRDB | 181 | β | 10 | 13 MB | |
| | | `HUMAN_TCRB_IMGT_576` | TCRB (Beta) | IMGT | 130 | 3 | 14 | 12 MB | |
| |
|
| | All models use a maximum sequence length of 576 nucleotides and were trained on 1000 epochs of synthetic data generated by [GenAIRR](https://github.com/MuteJester/GenAIRR). |
| |
|
| | ## Quick Start |
| |
|
| | ```bash |
| | pip install alignair[hub] |
| | ``` |
| |
|
| | ### Python API |
| |
|
| | ```python |
| | from AlignAIR.Models import SingleChainAlignAIR |
| | from AlignAIR.Hub import get_model_path |
| | |
| | # Download and load a model (cached automatically) |
| | model_path = get_model_path("igh") # or "HUMAN_IGH_OGRDB_576" |
| | model = SingleChainAlignAIR.from_pretrained(model_path) |
| | ``` |
| |
|
| | ### CLI |
| |
|
| | ```bash |
| | # Run inference with a pretrained model |
| | alignair --model-dir HUMAN_IGH_OGRDB_576 input_sequences.csv -o results/ |
| | ``` |
| |
|
| | ## Benchmark Results (100K synthetic sequences) |
| |
|
| | | Model | AlignAIR V | IgBLAST V | AlignAIR D | IgBLAST D | AlignAIR J | IgBLAST J | AlignAIR Speed | |
| | |-------|-----------|-----------|-----------|-----------|-----------|-----------|----------------| |
| | | IGH OGRDB | 94.1% | 95.5% | 81.7% | 69.8% | 99.3% | 99.5% | 4,272 seq/s | |
| | | IGH Extended | 92.3% | 93.9% | 88.5% | 82.6% | 98.7% | 98.4% | 4,245 seq/s | |
| | | IGK OGRDB | 94.6% | 95.4% | β | β | 97.2% | 96.0% | 4,807 seq/s | |
| | | IGL OGRDB | 93.9% | 95.3% | β | β | 98.4% | 96.7% | 5,384 seq/s | |
| | | TCRB IMGT | 96.5% | 96.2% | 89.6% | 76.3% | 99.6% | 99.1% | 4,317 seq/s | |
| |
|
| | Speed measured on NVIDIA RTX 3090 Ti (GPU) vs IgBLAST 1.22.0 (8 CPU threads). |
| |
|
| | ## Model Architecture |
| |
|
| | Each model is a `SingleChainAlignAIR` module combining: |
| | - **Nucleotide embedding** (5β64 dim) with center-padded tokenization |
| | - **Spatial segmentation** via 9-layer dilated convolutions (receptive field = 1023 nt) |
| | - **Conditioned boundary heads** with chain decoding (v_start β v_end β d_start β ...) |
| | - **Classification heads** for V/D/J allele assignment |
| | - **Analysis heads** for mutation rate and productivity prediction |
| | - **In-model orientation correction** (4-class: forward, reverse-complement, complement, reverse) |
| | |
| | ## Bundle Format |
| | |
| | Each model directory contains: |
| | - `model.pt` β PyTorch state dict |
| | - `config.json` β Architecture hyperparameters |
| | - `dataconfig.pkl` β Germline allele database (GenAIRR DataConfig) |
| | - `training_meta.json` β Training provenance |
| | - `VERSION` β Bundle format version |
| | - `fingerprint.txt` β SHA-256 integrity hash |
| |
|
| | ## Citation |
| |
|
| | If you use AlignAIR in your research, please cite: |
| |
|
| | > Konstantinovsky, T., Peres, A., Eisenberg, R., Polak, P., Lindenbaum, O., & Yaari, G. (2025). Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning. *Nucleic Acids Research*, 53(13). https://doi.org/10.1093/nar/gkaf651 |
| |
|
| | ```bibtex |
| | @article{Konstantinovsky2025, |
| | title = {Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning}, |
| | volume = {53}, |
| | ISSN = {1362-4962}, |
| | url = {http://dx.doi.org/10.1093/nar/gkaf651}, |
| | DOI = {10.1093/nar/gkaf651}, |
| | number = {13}, |
| | journal = {Nucleic Acids Research}, |
| | publisher = {Oxford University Press (OUP)}, |
| | author = {Konstantinovsky, Thomas and Peres, Ayelet and Eisenberg, Ran and Polak, Pazit and Lindenbaum, Ofir and Yaari, Gur}, |
| | year = {2025}, |
| | month = jul |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | GPL-3.0. See [LICENSE](https://github.com/MuteJester/AlignAIR/blob/main/LICENSE). |
| |
|
| | ## Links |
| |
|
| | - [GitHub Repository](https://github.com/MuteJester/AlignAIR) |
| | - [Documentation](https://mutejester.github.io/AlignAIR/) |
| | - [PyPI Package](https://pypi.org/project/AlignAIR/) |
| |
|