File size: 10,829 Bytes

de6669e

---
language: dna
tags:
  - Biology
  - DNA
  - Splicing
license: agpl-3.0
datasets:
  - multimolecule/gencode
library_name: multimolecule
---

# SpTransformer

Transformer network for predicting tissue-specific splicing from pre-mRNA sequences.

## Disclaimer

This is an UNOFFICIAL implementation of [SpliceTransformer predicts tissue-specific splicing linked to human diseases](https://doi.org/10.1038/s41467-024-53088-6) by Ningyuan You, Chang Liu, Yuxin Gu, et al. and Ning Shen.

The OFFICIAL repository of SpliceTransformer (SpTransformer) is at [ShenLab-Genomics/SpliceTransformer](https://github.com/ShenLab-Genomics/SpliceTransformer).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing SpTransformer did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

SpTransformer (SpliceTransformer) is a deep neural network that predicts tissue-specific splicing from primary pre-mRNA sequence.
It combines two pretrained SpliceAI-style dilated-residual convolutional feature extractors with a trainable input-projection path; the concatenated features are processed by a Sinkhorn transformer attention block with axial positional embeddings.
For each position the network predicts a 3-channel splice-site score (no-splice / acceptor / donor) and a per-position splice-site usage score across 15 human tissues.
The model uses a fixed flanking context of 4,000 nucleotides on each side of every predicted position.
SpTransformer is typically used to estimate the effect of genetic variants on tissue-specific splicing by scoring reference and alternate sequences and taking the difference.
Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

| Num Layers | Hidden Size | Num Heads | Intermediate Size | Max Seq Len | Num Parameters (M) | FLOPs (G) | MACs (G) | Context |
| ---------- | ----------- | --------- | ----------------- | ----------- | ------------------ | --------- | -------- | ------- |
| 8          | 256         | 8         | 1024              | 8192        | 17.07              | 290.72    | 144.65   | 4000    |

- Num Layers / Hidden Size / Num Heads / Intermediate Size / Max Seq Len describe the Sinkhorn transformer attention block.
- The two SpliceAI-style feature extractors use hidden sizes 128 and 64; Num Parameters counts the full checkpoint.
- Context is the fixed flanking context (in nucleotides) consumed on each side of every predicted position.
- FLOPs and MACs are measured on a 100-nucleotide input.

### Links

- **Code**: [multimolecule.sptransformer](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/sptransformer)
- **Weights**: [multimolecule/sptransformer](https://huggingface.co/multimolecule/sptransformer)
- **Paper**: [SpliceTransformer predicts tissue-specific splicing linked to human diseases](https://doi.org/10.1038/s41467-024-53088-6)
- **Developed by**: Ningyuan You, Chang Liu, Yuxin Gu, Rong Wang, Hanying Jia, Tianyun Zhang, Song Jiang, Jinsong Shi, Ming Chen, Min-Xin Guan, Siqi Sun, Shanshan Pei, Zhihong Liu, Ning Shen
- **Original Repository**: [ShenLab-Genomics/SpliceTransformer](https://github.com/ShenLab-Genomics/SpliceTransformer)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### RNA Splicing Site Prediction

You can use this model directly to predict per-nucleotide tissue-specific splicing of a pre-mRNA sequence:

```python
>>> from multimolecule import DnaTokenizer, SpTransformerModel

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/sptransformer")
>>> model = SpTransformerModel.from_pretrained("multimolecule/sptransformer")
>>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"])

>>> output.keys()
odict_keys(['last_hidden_state', 'logits'])
```

The `logits` tensor reproduces the original SpTransformer output: a 3-channel splice-site score (no-splice / acceptor / donor) and a per-tissue (15 tissues) splice-site usage score for each position.

### Downstream Use

#### Token Prediction

You can fine-tune SpTransformer for per-nucleotide tissue-specific splicing regression with [`SpTransformerForTokenPrediction`][multimolecule.models.SpTransformerForTokenPrediction], which adds a shared token prediction head on top of the backbone.

### Interpretability: Faithful Sparse-Attention Exposure

SpTransformer's attention block does **not** compute dense self-attention. Each layer
([`SpTransformerSelfAttention`][multimolecule.models.sptransformer.modeling_sptransformer.SpTransformerSelfAttention])
splits its heads into two groups with **fundamentally different sparse-attention structures**:

- **Windowed-local heads** — each window of `bucket_size` tokens attends only to itself plus the immediately
  preceding and following window (a `look_backward=1`, `look_forward=1` look-around). Boundary positions are
  masked.
- **Sinkhorn sorted-bucket heads** — each query bucket attends to the concatenation of (a) one _sorted /
  reordered_ key bucket selected by a parameter-free attention-sort net (`differentiable_topk(R, k=1)`) and
  (b) its own local bucket.

Because these two patterns operate on different key axes, there is **no single dense `(batch, heads,
sequence, sequence)` tensor that faithfully represents the computation**. Materialising a zero-filled
`sequence x sequence` grid would be a _misleading_ interpretability artifact, so this model does **not**
expose one.

Instead, attention recording is **opt-in** and faithful. Passing `output_attentions=True` (or setting
`config.output_attentions=True`) returns, for every attention layer, a
[`SpTransformerAttentionMap`][multimolecule.models.SpTransformerAttentionMap] holding the _actual_ `softmax`
weights used in the forward pass plus the indexing/permutation needed to map them back to absolute sequence
positions:

- `local_attentions` `(B, num_local_heads, num_windows, W, (look_backward + 1 + look_forward) * W)` — the
  real per-window softmax weights; padded look-around columns carry weight `0`.
- `local_key_positions` `(num_windows, (look_backward + 1 + look_forward) * W)` — absolute source position
  of every local key-axis column (`-1` marks padded columns).
- `sinkhorn_attentions` `(B, num_sinkhorn_heads, num_buckets, W, 2 * W)` — the real per-bucket softmax
  weights over the `[reordered-bucket | own-bucket]` key axis.
- `sinkhorn_reorder` `(B, num_sinkhorn_heads, num_buckets, num_buckets)` — the exact bucket-permutation
  matrix; for query bucket `u`, the nonzero column `v` of row `u` says the reordered key bucket (columns
  `0:W` of `sinkhorn_attentions`) is source bucket `v` (absolute positions `v*W : v*W + W`).
- scalar metadata: `bucket_size`, `look_backward`, `look_forward`, `num_local_heads`,
  `num_sinkhorn_heads`, `sequence_length`.

`W` is `bucket_size`; local heads come first along the head axis, Sinkhorn heads second. These are
**structured block weights, not dense attention matrices** — re-deriving the per-type attention output by
contracting these exact weights with the (block-gathered) values reproduces the layer output exactly.
Recording is opt-in, so the default forward path and its numerics are byte-for-byte unchanged.

```python
>>> import torch
>>> from multimolecule import SpTransformerConfig, SpTransformerModel
>>> config = SpTransformerConfig(bucket_size=4, max_seq_len=16, context=2, num_hidden_layers=2)
>>> model = SpTransformerModel(config).eval()
>>> output = model(torch.randint(config.vocab_size, (1, 16)), output_attentions=True)
>>> layer0 = output.attentions[0]
>>> layer0.local_attentions.shape
torch.Size([1, 2, 4, 4, 12])
>>> layer0.sinkhorn_attentions.shape
torch.Size([1, 6, 4, 4, 8])
>>> layer0.sinkhorn_reorder.shape
torch.Size([1, 6, 4, 4])
```

## Training Details

SpTransformer was trained to predict tissue-specific splicing from primary pre-mRNA sequence.

### Training Data

SpTransformer was trained on splicing measurements derived from RNA-seq data across 15 human tissues, using gene annotations from [GENCODE](https://multimolecule.danling.org/datasets/gencode), together with multi-species sequence data.
The two convolutional feature extractors were pre-trained as SpliceAI-style splice-site predictors; MultiMolecule exposes them as trainable submodules for downstream fine-tuning.
For each predicted nucleotide, a sequence window centered on that nucleotide was used, with the flanking context padded with `N` (unknown nucleotide) when near transcript ends.

### Training Procedure

#### Pre-training

The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over per-tissue splice-site usage, comparing predictions against measurements derived from RNA-seq.

## Citation

```bibtex
@article{You2024,
  author    = {You, Ningyuan and Liu, Chang and Gu, Yuxin and Wang, Rong and Jia, Hanying and Zhang, Tianyun and Jiang, Song and Shi, Jinsong and Chen, Ming and Guan, Min-Xin and Sun, Siqi and Pei, Shanshan and Liu, Zhihong and Shen, Ning},
  title     = {{SpliceTransformer predicts tissue-specific splicing linked to human diseases}},
  journal   = {Nature Communications},
  year      = {2024},
  volume    = {15},
  number    = {1},
  pages     = {9129},
  month     = {oct},
  doi       = {10.1038/s41467-024-53088-6},
  issn      = {2041-1723},
  url       = {https://doi.org/10.1038/s41467-024-53088-6}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [SpliceTransformer paper](https://doi.org/10.1038/s41467-024-53088-6) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```