language: en
license: gpl-3.0
tags:
- immunoinformatics
- antibody
- TCR
- AIRR
- sequence-alignment
- bioinformatics
- pytorch
library_name: alignair
pipeline_tag: token-classification
AlignAIR Pretrained Models
AlignAIR is a deep learning tool for aligning immunoglobulin (IG) and T-cell receptor (TCR) sequences to germline gene databases. It simultaneously predicts V/D/J gene assignments, segment boundaries, mutation rates, and productivity β all in a single forward pass.
Available Models
| Model | Chain | Germline DB | V Alleles | D Alleles | J Alleles | Size |
|---|---|---|---|---|---|---|
HUMAN_IGH_OGRDB_576 |
IGH (Heavy) | OGRDB | 198 | 33 | 7 | 17 MB |
HUMAN_IGH_EXTENDED_576 |
IGH (Heavy) | Extended | 342 | 37 | 10 | 28 MB |
HUMAN_IGK_OGRDB_576 |
IGK (Kappa) | OGRDB | 168 | β | 8 | 12 MB |
HUMAN_IGL_OGRDB_576 |
IGL (Lambda) | OGRDB | 181 | β | 10 | 13 MB |
HUMAN_TCRB_IMGT_576 |
TCRB (Beta) | IMGT | 130 | 3 | 14 | 12 MB |
All models use a maximum sequence length of 576 nucleotides and were trained on 1000 epochs of synthetic data generated by GenAIRR.
Quick Start
pip install alignair[hub]
Python API
from AlignAIR.Models import SingleChainAlignAIR
from AlignAIR.Hub import get_model_path
# Download and load a model (cached automatically)
model_path = get_model_path("igh") # or "HUMAN_IGH_OGRDB_576"
model = SingleChainAlignAIR.from_pretrained(model_path)
CLI
# Run inference with a pretrained model
alignair --model-dir HUMAN_IGH_OGRDB_576 input_sequences.csv -o results/
Benchmark Results (100K synthetic sequences)
| Model | AlignAIR V | IgBLAST V | AlignAIR D | IgBLAST D | AlignAIR J | IgBLAST J | AlignAIR Speed |
|---|---|---|---|---|---|---|---|
| IGH OGRDB | 94.1% | 95.5% | 81.7% | 69.8% | 99.3% | 99.5% | 4,272 seq/s |
| IGH Extended | 92.3% | 93.9% | 88.5% | 82.6% | 98.7% | 98.4% | 4,245 seq/s |
| IGK OGRDB | 94.6% | 95.4% | β | β | 97.2% | 96.0% | 4,807 seq/s |
| IGL OGRDB | 93.9% | 95.3% | β | β | 98.4% | 96.7% | 5,384 seq/s |
| TCRB IMGT | 96.5% | 96.2% | 89.6% | 76.3% | 99.6% | 99.1% | 4,317 seq/s |
Speed measured on NVIDIA RTX 3090 Ti (GPU) vs IgBLAST 1.22.0 (8 CPU threads).
Model Architecture
Each model is a SingleChainAlignAIR module combining:
- Nucleotide embedding (5β64 dim) with center-padded tokenization
- Spatial segmentation via 9-layer dilated convolutions (receptive field = 1023 nt)
- Conditioned boundary heads with chain decoding (v_start β v_end β d_start β ...)
- Classification heads for V/D/J allele assignment
- Analysis heads for mutation rate and productivity prediction
- In-model orientation correction (4-class: forward, reverse-complement, complement, reverse)
Bundle Format
Each model directory contains:
model.ptβ PyTorch state dictconfig.jsonβ Architecture hyperparametersdataconfig.pklβ Germline allele database (GenAIRR DataConfig)training_meta.jsonβ Training provenanceVERSIONβ Bundle format versionfingerprint.txtβ SHA-256 integrity hash
Citation
If you use AlignAIR in your research, please cite:
Konstantinovsky, T., Peres, A., Eisenberg, R., Polak, P., Lindenbaum, O., & Yaari, G. (2025). Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning. Nucleic Acids Research, 53(13). https://doi.org/10.1093/nar/gkaf651
@article{Konstantinovsky2025,
title = {Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning},
volume = {53},
ISSN = {1362-4962},
url = {http://dx.doi.org/10.1093/nar/gkaf651},
DOI = {10.1093/nar/gkaf651},
number = {13},
journal = {Nucleic Acids Research},
publisher = {Oxford University Press (OUP)},
author = {Konstantinovsky, Thomas and Peres, Ayelet and Eisenberg, Ran and Polak, Pazit and Lindenbaum, Ofir and Yaari, Gur},
year = {2025},
month = jul
}
License
GPL-3.0. See LICENSE.