File size: 10,829 Bytes
de6669e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
language: dna
tags:
  - Biology
  - DNA
  - Splicing
license: agpl-3.0
datasets:
  - multimolecule/gencode
library_name: multimolecule
---

# SpTransformer

Transformer network for predicting tissue-specific splicing from pre-mRNA sequences.

## Disclaimer

This is an UNOFFICIAL implementation of [SpliceTransformer predicts tissue-specific splicing linked to human diseases](https://doi.org/10.1038/s41467-024-53088-6) by Ningyuan You, Chang Liu, Yuxin Gu, et al. and Ning Shen.

The OFFICIAL repository of SpliceTransformer (SpTransformer) is at [ShenLab-Genomics/SpliceTransformer](https://github.com/ShenLab-Genomics/SpliceTransformer).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing SpTransformer did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

SpTransformer (SpliceTransformer) is a deep neural network that predicts tissue-specific splicing from primary pre-mRNA sequence.
It combines two pretrained SpliceAI-style dilated-residual convolutional feature extractors with a trainable input-projection path; the concatenated features are processed by a Sinkhorn transformer attention block with axial positional embeddings.
For each position the network predicts a 3-channel splice-site score (no-splice / acceptor / donor) and a per-position splice-site usage score across 15 human tissues.
The model uses a fixed flanking context of 4,000 nucleotides on each side of every predicted position.
SpTransformer is typically used to estimate the effect of genetic variants on tissue-specific splicing by scoring reference and alternate sequences and taking the difference.
Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

| Num Layers | Hidden Size | Num Heads | Intermediate Size | Max Seq Len | Num Parameters (M) | FLOPs (G) | MACs (G) | Context |
| ---------- | ----------- | --------- | ----------------- | ----------- | ------------------ | --------- | -------- | ------- |
| 8          | 256         | 8         | 1024              | 8192        | 17.07              | 290.72    | 144.65   | 4000    |

- Num Layers / Hidden Size / Num Heads / Intermediate Size / Max Seq Len describe the Sinkhorn transformer attention block.
- The two SpliceAI-style feature extractors use hidden sizes 128 and 64; Num Parameters counts the full checkpoint.
- Context is the fixed flanking context (in nucleotides) consumed on each side of every predicted position.
- FLOPs and MACs are measured on a 100-nucleotide input.

### Links

- **Code**: [multimolecule.sptransformer](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/sptransformer)
- **Weights**: [multimolecule/sptransformer](https://huggingface.co/multimolecule/sptransformer)
- **Paper**: [SpliceTransformer predicts tissue-specific splicing linked to human diseases](https://doi.org/10.1038/s41467-024-53088-6)
- **Developed by**: Ningyuan You, Chang Liu, Yuxin Gu, Rong Wang, Hanying Jia, Tianyun Zhang, Song Jiang, Jinsong Shi, Ming Chen, Min-Xin Guan, Siqi Sun, Shanshan Pei, Zhihong Liu, Ning Shen
- **Original Repository**: [ShenLab-Genomics/SpliceTransformer](https://github.com/ShenLab-Genomics/SpliceTransformer)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### RNA Splicing Site Prediction

You can use this model directly to predict per-nucleotide tissue-specific splicing of a pre-mRNA sequence:

```python
>>> from multimolecule import DnaTokenizer, SpTransformerModel

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/sptransformer")
>>> model = SpTransformerModel.from_pretrained("multimolecule/sptransformer")
>>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"])

>>> output.keys()
odict_keys(['last_hidden_state', 'logits'])
```

The `logits` tensor reproduces the original SpTransformer output: a 3-channel splice-site score (no-splice / acceptor / donor) and a per-tissue (15 tissues) splice-site usage score for each position.

### Downstream Use

#### Token Prediction

You can fine-tune SpTransformer for per-nucleotide tissue-specific splicing regression with [`SpTransformerForTokenPrediction`][multimolecule.models.SpTransformerForTokenPrediction], which adds a shared token prediction head on top of the backbone.

### Interpretability: Faithful Sparse-Attention Exposure

SpTransformer's attention block does **not** compute dense self-attention. Each layer
([`SpTransformerSelfAttention`][multimolecule.models.sptransformer.modeling_sptransformer.SpTransformerSelfAttention])
splits its heads into two groups with **fundamentally different sparse-attention structures**:

- **Windowed-local heads** — each window of `bucket_size` tokens attends only to itself plus the immediately
  preceding and following window (a `look_backward=1`, `look_forward=1` look-around). Boundary positions are
  masked.
- **Sinkhorn sorted-bucket heads** — each query bucket attends to the concatenation of (a) one _sorted /
  reordered_ key bucket selected by a parameter-free attention-sort net (`differentiable_topk(R, k=1)`) and
  (b) its own local bucket.

Because these two patterns operate on different key axes, there is **no single dense `(batch, heads,
sequence, sequence)` tensor that faithfully represents the computation**. Materialising a zero-filled
`sequence x sequence` grid would be a _misleading_ interpretability artifact, so this model does **not**
expose one.

Instead, attention recording is **opt-in** and faithful. Passing `output_attentions=True` (or setting
`config.output_attentions=True`) returns, for every attention layer, a
[`SpTransformerAttentionMap`][multimolecule.models.SpTransformerAttentionMap] holding the _actual_ `softmax`
weights used in the forward pass plus the indexing/permutation needed to map them back to absolute sequence
positions:

- `local_attentions` `(B, num_local_heads, num_windows, W, (look_backward + 1 + look_forward) * W)` — the
  real per-window softmax weights; padded look-around columns carry weight `0`.
- `local_key_positions` `(num_windows, (look_backward + 1 + look_forward) * W)` — absolute source position
  of every local key-axis column (`-1` marks padded columns).
- `sinkhorn_attentions` `(B, num_sinkhorn_heads, num_buckets, W, 2 * W)` — the real per-bucket softmax
  weights over the `[reordered-bucket | own-bucket]` key axis.
- `sinkhorn_reorder` `(B, num_sinkhorn_heads, num_buckets, num_buckets)` — the exact bucket-permutation
  matrix; for query bucket `u`, the nonzero column `v` of row `u` says the reordered key bucket (columns
  `0:W` of `sinkhorn_attentions`) is source bucket `v` (absolute positions `v*W : v*W + W`).
- scalar metadata: `bucket_size`, `look_backward`, `look_forward`, `num_local_heads`,
  `num_sinkhorn_heads`, `sequence_length`.

`W` is `bucket_size`; local heads come first along the head axis, Sinkhorn heads second. These are
**structured block weights, not dense attention matrices** — re-deriving the per-type attention output by
contracting these exact weights with the (block-gathered) values reproduces the layer output exactly.
Recording is opt-in, so the default forward path and its numerics are byte-for-byte unchanged.

```python
>>> import torch
>>> from multimolecule import SpTransformerConfig, SpTransformerModel
>>> config = SpTransformerConfig(bucket_size=4, max_seq_len=16, context=2, num_hidden_layers=2)
>>> model = SpTransformerModel(config).eval()
>>> output = model(torch.randint(config.vocab_size, (1, 16)), output_attentions=True)
>>> layer0 = output.attentions[0]
>>> layer0.local_attentions.shape
torch.Size([1, 2, 4, 4, 12])
>>> layer0.sinkhorn_attentions.shape
torch.Size([1, 6, 4, 4, 8])
>>> layer0.sinkhorn_reorder.shape
torch.Size([1, 6, 4, 4])
```

## Training Details

SpTransformer was trained to predict tissue-specific splicing from primary pre-mRNA sequence.

### Training Data

SpTransformer was trained on splicing measurements derived from RNA-seq data across 15 human tissues, using gene annotations from [GENCODE](https://multimolecule.danling.org/datasets/gencode), together with multi-species sequence data.
The two convolutional feature extractors were pre-trained as SpliceAI-style splice-site predictors; MultiMolecule exposes them as trainable submodules for downstream fine-tuning.
For each predicted nucleotide, a sequence window centered on that nucleotide was used, with the flanking context padded with `N` (unknown nucleotide) when near transcript ends.

### Training Procedure

#### Pre-training

The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over per-tissue splice-site usage, comparing predictions against measurements derived from RNA-seq.

## Citation

```bibtex
@article{You2024,
  author    = {You, Ningyuan and Liu, Chang and Gu, Yuxin and Wang, Rong and Jia, Hanying and Zhang, Tianyun and Jiang, Song and Shi, Jinsong and Chen, Ming and Guan, Min-Xin and Sun, Siqi and Pei, Shanshan and Liu, Zhihong and Shen, Ning},
  title     = {{SpliceTransformer predicts tissue-specific splicing linked to human diseases}},
  journal   = {Nature Communications},
  year      = {2024},
  volume    = {15},
  number    = {1},
  pages     = {9129},
  month     = {oct},
  doi       = {10.1038/s41467-024-53088-6},
  issn      = {2041-1723},
  url       = {https://doi.org/10.1038/s41467-024-53088-6}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [SpliceTransformer paper](https://doi.org/10.1038/s41467-024-53088-6) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```