Upload AlignAIR 3.0 pretrained models (5 models)

2e422e0 verified about 17 hours ago

4.43 kB

language: en
license: gpl-3.0
tags:
  - immunoinformatics
  - antibody
  - TCR
  - AIRR
  - sequence-alignment
  - bioinformatics
  - pytorch
library_name: alignair
pipeline_tag: token-classification

AlignAIR Pretrained Models

AlignAIR is a deep learning tool for aligning immunoglobulin (IG) and T-cell receptor (TCR) sequences to germline gene databases. It simultaneously predicts V/D/J gene assignments, segment boundaries, mutation rates, and productivity — all in a single forward pass.

Available Models

Model	Chain	Germline DB	V Alleles	D Alleles	J Alleles	Size
`HUMAN_IGH_OGRDB_576`	IGH (Heavy)	OGRDB	198	33	7	17 MB
`HUMAN_IGH_EXTENDED_576`	IGH (Heavy)	Extended	342	37	10	28 MB
`HUMAN_IGK_OGRDB_576`	IGK (Kappa)	OGRDB	168	—	8	12 MB
`HUMAN_IGL_OGRDB_576`	IGL (Lambda)	OGRDB	181	—	10	13 MB
`HUMAN_TCRB_IMGT_576`	TCRB (Beta)	IMGT	130	3	14	12 MB

All models use a maximum sequence length of 576 nucleotides and were trained on 1000 epochs of synthetic data generated by GenAIRR.

Quick Start

pip install alignair[hub]

Python API

from AlignAIR.Models import SingleChainAlignAIR
from AlignAIR.Hub import get_model_path

# Download and load a model (cached automatically)
model_path = get_model_path("igh")  # or "HUMAN_IGH_OGRDB_576"
model = SingleChainAlignAIR.from_pretrained(model_path)

CLI

# Run inference with a pretrained model
alignair --model-dir HUMAN_IGH_OGRDB_576 input_sequences.csv -o results/

Benchmark Results (100K synthetic sequences)

Model	AlignAIR V	IgBLAST V	AlignAIR D	IgBLAST D	AlignAIR J	IgBLAST J	AlignAIR Speed
IGH OGRDB	94.1%	95.5%	81.7%	69.8%	99.3%	99.5%	4,272 seq/s
IGH Extended	92.3%	93.9%	88.5%	82.6%	98.7%	98.4%	4,245 seq/s
IGK OGRDB	94.6%	95.4%	—	—	97.2%	96.0%	4,807 seq/s
IGL OGRDB	93.9%	95.3%	—	—	98.4%	96.7%	5,384 seq/s
TCRB IMGT	96.5%	96.2%	89.6%	76.3%	99.6%	99.1%	4,317 seq/s

Speed measured on NVIDIA RTX 3090 Ti (GPU) vs IgBLAST 1.22.0 (8 CPU threads).

Model Architecture

Each model is a SingleChainAlignAIR module combining:

Nucleotide embedding (5→64 dim) with center-padded tokenization
Spatial segmentation via 9-layer dilated convolutions (receptive field = 1023 nt)
Conditioned boundary heads with chain decoding (v_start → v_end → d_start → ...)
Classification heads for V/D/J allele assignment
Analysis heads for mutation rate and productivity prediction
In-model orientation correction (4-class: forward, reverse-complement, complement, reverse)

Bundle Format

Each model directory contains:

model.pt — PyTorch state dict
config.json — Architecture hyperparameters
dataconfig.pkl — Germline allele database (GenAIRR DataConfig)
training_meta.json — Training provenance
VERSION — Bundle format version
fingerprint.txt — SHA-256 integrity hash

Citation

If you use AlignAIR in your research, please cite:

Konstantinovsky, T., Peres, A., Eisenberg, R., Polak, P., Lindenbaum, O., & Yaari, G. (2025). Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning. Nucleic Acids Research, 53(13). https://doi.org/10.1093/nar/gkaf651

@article{Konstantinovsky2025,
  title = {Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning},
  volume = {53},
  ISSN = {1362-4962},
  url = {http://dx.doi.org/10.1093/nar/gkaf651},
  DOI = {10.1093/nar/gkaf651},
  number = {13},
  journal = {Nucleic Acids Research},
  publisher = {Oxford University Press (OUP)},
  author = {Konstantinovsky, Thomas and Peres, Ayelet and Eisenberg, Ran and Polak, Pazit and Lindenbaum, Ofir and Yaari, Gur},
  year = {2025},
  month = jul
}

License

GPL-3.0. See LICENSE.

AlignAIR
/

AlignAIR-pretrained