File size: 4,430 Bytes
1e943ad 2e422e0 1e943ad 2e422e0 1e943ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | ---
language: en
license: gpl-3.0
tags:
- immunoinformatics
- antibody
- TCR
- AIRR
- sequence-alignment
- bioinformatics
- pytorch
library_name: alignair
pipeline_tag: token-classification
---
# AlignAIR Pretrained Models
**AlignAIR** is a deep learning tool for aligning immunoglobulin (IG) and T-cell receptor (TCR) sequences to germline gene databases. It simultaneously predicts V/D/J gene assignments, segment boundaries, mutation rates, and productivity β all in a single forward pass.
## Available Models
| Model | Chain | Germline DB | V Alleles | D Alleles | J Alleles | Size |
|-------|-------|-------------|-----------|-----------|-----------|------|
| `HUMAN_IGH_OGRDB_576` | IGH (Heavy) | OGRDB | 198 | 33 | 7 | 17 MB |
| `HUMAN_IGH_EXTENDED_576` | IGH (Heavy) | Extended | 342 | 37 | 10 | 28 MB |
| `HUMAN_IGK_OGRDB_576` | IGK (Kappa) | OGRDB | 168 | β | 8 | 12 MB |
| `HUMAN_IGL_OGRDB_576` | IGL (Lambda) | OGRDB | 181 | β | 10 | 13 MB |
| `HUMAN_TCRB_IMGT_576` | TCRB (Beta) | IMGT | 130 | 3 | 14 | 12 MB |
All models use a maximum sequence length of 576 nucleotides and were trained on 1000 epochs of synthetic data generated by [GenAIRR](https://github.com/MuteJester/GenAIRR).
## Quick Start
```bash
pip install alignair[hub]
```
### Python API
```python
from AlignAIR.Models import SingleChainAlignAIR
from AlignAIR.Hub import get_model_path
# Download and load a model (cached automatically)
model_path = get_model_path("igh") # or "HUMAN_IGH_OGRDB_576"
model = SingleChainAlignAIR.from_pretrained(model_path)
```
### CLI
```bash
# Run inference with a pretrained model
alignair --model-dir HUMAN_IGH_OGRDB_576 input_sequences.csv -o results/
```
## Benchmark Results (100K synthetic sequences)
| Model | AlignAIR V | IgBLAST V | AlignAIR D | IgBLAST D | AlignAIR J | IgBLAST J | AlignAIR Speed |
|-------|-----------|-----------|-----------|-----------|-----------|-----------|----------------|
| IGH OGRDB | 94.1% | 95.5% | 81.7% | 69.8% | 99.3% | 99.5% | 4,272 seq/s |
| IGH Extended | 92.3% | 93.9% | 88.5% | 82.6% | 98.7% | 98.4% | 4,245 seq/s |
| IGK OGRDB | 94.6% | 95.4% | β | β | 97.2% | 96.0% | 4,807 seq/s |
| IGL OGRDB | 93.9% | 95.3% | β | β | 98.4% | 96.7% | 5,384 seq/s |
| TCRB IMGT | 96.5% | 96.2% | 89.6% | 76.3% | 99.6% | 99.1% | 4,317 seq/s |
Speed measured on NVIDIA RTX 3090 Ti (GPU) vs IgBLAST 1.22.0 (8 CPU threads).
## Model Architecture
Each model is a `SingleChainAlignAIR` module combining:
- **Nucleotide embedding** (5β64 dim) with center-padded tokenization
- **Spatial segmentation** via 9-layer dilated convolutions (receptive field = 1023 nt)
- **Conditioned boundary heads** with chain decoding (v_start β v_end β d_start β ...)
- **Classification heads** for V/D/J allele assignment
- **Analysis heads** for mutation rate and productivity prediction
- **In-model orientation correction** (4-class: forward, reverse-complement, complement, reverse)
## Bundle Format
Each model directory contains:
- `model.pt` β PyTorch state dict
- `config.json` β Architecture hyperparameters
- `dataconfig.pkl` β Germline allele database (GenAIRR DataConfig)
- `training_meta.json` β Training provenance
- `VERSION` β Bundle format version
- `fingerprint.txt` β SHA-256 integrity hash
## Citation
If you use AlignAIR in your research, please cite:
> Konstantinovsky, T., Peres, A., Eisenberg, R., Polak, P., Lindenbaum, O., & Yaari, G. (2025). Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning. *Nucleic Acids Research*, 53(13). https://doi.org/10.1093/nar/gkaf651
```bibtex
@article{Konstantinovsky2025,
title = {Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning},
volume = {53},
ISSN = {1362-4962},
url = {http://dx.doi.org/10.1093/nar/gkaf651},
DOI = {10.1093/nar/gkaf651},
number = {13},
journal = {Nucleic Acids Research},
publisher = {Oxford University Press (OUP)},
author = {Konstantinovsky, Thomas and Peres, Ayelet and Eisenberg, Ran and Polak, Pazit and Lindenbaum, Ofir and Yaari, Gur},
year = {2025},
month = jul
}
```
## License
GPL-3.0. See [LICENSE](https://github.com/MuteJester/AlignAIR/blob/main/LICENSE).
## Links
- [GitHub Repository](https://github.com/MuteJester/AlignAIR)
- [Documentation](https://mutejester.github.io/AlignAIR/)
- [PyPI Package](https://pypi.org/project/AlignAIR/)
|