File size: 4,430 Bytes
1e943ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e422e0
 
1e943ad
2e422e0
 
 
 
 
 
 
 
 
 
 
 
1e943ad
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
language: en
license: gpl-3.0
tags:
  - immunoinformatics
  - antibody
  - TCR
  - AIRR
  - sequence-alignment
  - bioinformatics
  - pytorch
library_name: alignair
pipeline_tag: token-classification
---

# AlignAIR Pretrained Models

**AlignAIR** is a deep learning tool for aligning immunoglobulin (IG) and T-cell receptor (TCR) sequences to germline gene databases. It simultaneously predicts V/D/J gene assignments, segment boundaries, mutation rates, and productivity β€” all in a single forward pass.

## Available Models

| Model | Chain | Germline DB | V Alleles | D Alleles | J Alleles | Size |
|-------|-------|-------------|-----------|-----------|-----------|------|
| `HUMAN_IGH_OGRDB_576` | IGH (Heavy) | OGRDB | 198 | 33 | 7 | 17 MB |
| `HUMAN_IGH_EXTENDED_576` | IGH (Heavy) | Extended | 342 | 37 | 10 | 28 MB |
| `HUMAN_IGK_OGRDB_576` | IGK (Kappa) | OGRDB | 168 | β€” | 8 | 12 MB |
| `HUMAN_IGL_OGRDB_576` | IGL (Lambda) | OGRDB | 181 | β€” | 10 | 13 MB |
| `HUMAN_TCRB_IMGT_576` | TCRB (Beta) | IMGT | 130 | 3 | 14 | 12 MB |

All models use a maximum sequence length of 576 nucleotides and were trained on 1000 epochs of synthetic data generated by [GenAIRR](https://github.com/MuteJester/GenAIRR).

## Quick Start

```bash
pip install alignair[hub]
```

### Python API

```python
from AlignAIR.Models import SingleChainAlignAIR
from AlignAIR.Hub import get_model_path

# Download and load a model (cached automatically)
model_path = get_model_path("igh")  # or "HUMAN_IGH_OGRDB_576"
model = SingleChainAlignAIR.from_pretrained(model_path)
```

### CLI

```bash
# Run inference with a pretrained model
alignair --model-dir HUMAN_IGH_OGRDB_576 input_sequences.csv -o results/
```

## Benchmark Results (100K synthetic sequences)

| Model | AlignAIR V | IgBLAST V | AlignAIR D | IgBLAST D | AlignAIR J | IgBLAST J | AlignAIR Speed |
|-------|-----------|-----------|-----------|-----------|-----------|-----------|----------------|
| IGH OGRDB | 94.1% | 95.5% | 81.7% | 69.8% | 99.3% | 99.5% | 4,272 seq/s |
| IGH Extended | 92.3% | 93.9% | 88.5% | 82.6% | 98.7% | 98.4% | 4,245 seq/s |
| IGK OGRDB | 94.6% | 95.4% | β€” | β€” | 97.2% | 96.0% | 4,807 seq/s |
| IGL OGRDB | 93.9% | 95.3% | β€” | β€” | 98.4% | 96.7% | 5,384 seq/s |
| TCRB IMGT | 96.5% | 96.2% | 89.6% | 76.3% | 99.6% | 99.1% | 4,317 seq/s |

Speed measured on NVIDIA RTX 3090 Ti (GPU) vs IgBLAST 1.22.0 (8 CPU threads).

## Model Architecture

Each model is a `SingleChainAlignAIR` module combining:
- **Nucleotide embedding** (5β†’64 dim) with center-padded tokenization
- **Spatial segmentation** via 9-layer dilated convolutions (receptive field = 1023 nt)
- **Conditioned boundary heads** with chain decoding (v_start β†’ v_end β†’ d_start β†’ ...)
- **Classification heads** for V/D/J allele assignment
- **Analysis heads** for mutation rate and productivity prediction
- **In-model orientation correction** (4-class: forward, reverse-complement, complement, reverse)

## Bundle Format

Each model directory contains:
- `model.pt` β€” PyTorch state dict
- `config.json` β€” Architecture hyperparameters
- `dataconfig.pkl` β€” Germline allele database (GenAIRR DataConfig)
- `training_meta.json` β€” Training provenance
- `VERSION` β€” Bundle format version
- `fingerprint.txt` β€” SHA-256 integrity hash

## Citation

If you use AlignAIR in your research, please cite:

> Konstantinovsky, T., Peres, A., Eisenberg, R., Polak, P., Lindenbaum, O., & Yaari, G. (2025). Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning. *Nucleic Acids Research*, 53(13). https://doi.org/10.1093/nar/gkaf651

```bibtex
@article{Konstantinovsky2025,
  title = {Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning},
  volume = {53},
  ISSN = {1362-4962},
  url = {http://dx.doi.org/10.1093/nar/gkaf651},
  DOI = {10.1093/nar/gkaf651},
  number = {13},
  journal = {Nucleic Acids Research},
  publisher = {Oxford University Press (OUP)},
  author = {Konstantinovsky, Thomas and Peres, Ayelet and Eisenberg, Ran and Polak, Pazit and Lindenbaum, Ofir and Yaari, Gur},
  year = {2025},
  month = jul
}
```

## License

GPL-3.0. See [LICENSE](https://github.com/MuteJester/AlignAIR/blob/main/LICENSE).

## Links

- [GitHub Repository](https://github.com/MuteJester/AlignAIR)
- [Documentation](https://mutejester.github.io/AlignAIR/)
- [PyPI Package](https://pypi.org/project/AlignAIR/)