--- language: dna tags: - Biology - DNA - Splicing license: agpl-3.0 datasets: - multimolecule/gencode library_name: multimolecule --- # SpTransformer Transformer network for predicting tissue-specific splicing from pre-mRNA sequences. ## Disclaimer This is an UNOFFICIAL implementation of [SpliceTransformer predicts tissue-specific splicing linked to human diseases](https://doi.org/10.1038/s41467-024-53088-6) by Ningyuan You, Chang Liu, Yuxin Gu, et al. and Ning Shen. The OFFICIAL repository of SpliceTransformer (SpTransformer) is at [ShenLab-Genomics/SpliceTransformer](https://github.com/ShenLab-Genomics/SpliceTransformer). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing SpTransformer did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details SpTransformer (SpliceTransformer) is a deep neural network that predicts tissue-specific splicing from primary pre-mRNA sequence. It combines two pretrained SpliceAI-style dilated-residual convolutional feature extractors with a trainable input-projection path; the concatenated features are processed by a Sinkhorn transformer attention block with axial positional embeddings. For each position the network predicts a 3-channel splice-site score (no-splice / acceptor / donor) and a per-position splice-site usage score across 15 human tissues. The model uses a fixed flanking context of 4,000 nucleotides on each side of every predicted position. SpTransformer is typically used to estimate the effect of genetic variants on tissue-specific splicing by scoring reference and alternate sequences and taking the difference. Please refer to the [Training Details](#training-details) section for more information on the training process. ### Model Specification | Num Layers | Hidden Size | Num Heads | Intermediate Size | Max Seq Len | Num Parameters (M) | FLOPs (G) | MACs (G) | Context | | ---------- | ----------- | --------- | ----------------- | ----------- | ------------------ | --------- | -------- | ------- | | 8 | 256 | 8 | 1024 | 8192 | 17.07 | 290.72 | 144.65 | 4000 | - Num Layers / Hidden Size / Num Heads / Intermediate Size / Max Seq Len describe the Sinkhorn transformer attention block. - The two SpliceAI-style feature extractors use hidden sizes 128 and 64; Num Parameters counts the full checkpoint. - Context is the fixed flanking context (in nucleotides) consumed on each side of every predicted position. - FLOPs and MACs are measured on a 100-nucleotide input. ### Links - **Code**: [multimolecule.sptransformer](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/sptransformer) - **Weights**: [multimolecule/sptransformer](https://huggingface.co/multimolecule/sptransformer) - **Paper**: [SpliceTransformer predicts tissue-specific splicing linked to human diseases](https://doi.org/10.1038/s41467-024-53088-6) - **Developed by**: Ningyuan You, Chang Liu, Yuxin Gu, Rong Wang, Hanying Jia, Tianyun Zhang, Song Jiang, Jinsong Shi, Ming Chen, Min-Xin Guan, Siqi Sun, Shanshan Pei, Zhihong Liu, Ning Shen - **Original Repository**: [ShenLab-Genomics/SpliceTransformer](https://github.com/ShenLab-Genomics/SpliceTransformer) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### RNA Splicing Site Prediction You can use this model directly to predict per-nucleotide tissue-specific splicing of a pre-mRNA sequence: ```python >>> from multimolecule import DnaTokenizer, SpTransformerModel >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/sptransformer") >>> model = SpTransformerModel.from_pretrained("multimolecule/sptransformer") >>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"]) >>> output.keys() odict_keys(['last_hidden_state', 'logits']) ``` The `logits` tensor reproduces the original SpTransformer output: a 3-channel splice-site score (no-splice / acceptor / donor) and a per-tissue (15 tissues) splice-site usage score for each position. ### Downstream Use #### Token Prediction You can fine-tune SpTransformer for per-nucleotide tissue-specific splicing regression with [`SpTransformerForTokenPrediction`][multimolecule.models.SpTransformerForTokenPrediction], which adds a shared token prediction head on top of the backbone. ### Interpretability: Faithful Sparse-Attention Exposure SpTransformer's attention block does **not** compute dense self-attention. Each layer ([`SpTransformerSelfAttention`][multimolecule.models.sptransformer.modeling_sptransformer.SpTransformerSelfAttention]) splits its heads into two groups with **fundamentally different sparse-attention structures**: - **Windowed-local heads** — each window of `bucket_size` tokens attends only to itself plus the immediately preceding and following window (a `look_backward=1`, `look_forward=1` look-around). Boundary positions are masked. - **Sinkhorn sorted-bucket heads** — each query bucket attends to the concatenation of (a) one _sorted / reordered_ key bucket selected by a parameter-free attention-sort net (`differentiable_topk(R, k=1)`) and (b) its own local bucket. Because these two patterns operate on different key axes, there is **no single dense `(batch, heads, sequence, sequence)` tensor that faithfully represents the computation**. Materialising a zero-filled `sequence x sequence` grid would be a _misleading_ interpretability artifact, so this model does **not** expose one. Instead, attention recording is **opt-in** and faithful. Passing `output_attentions=True` (or setting `config.output_attentions=True`) returns, for every attention layer, a [`SpTransformerAttentionMap`][multimolecule.models.SpTransformerAttentionMap] holding the _actual_ `softmax` weights used in the forward pass plus the indexing/permutation needed to map them back to absolute sequence positions: - `local_attentions` `(B, num_local_heads, num_windows, W, (look_backward + 1 + look_forward) * W)` — the real per-window softmax weights; padded look-around columns carry weight `0`. - `local_key_positions` `(num_windows, (look_backward + 1 + look_forward) * W)` — absolute source position of every local key-axis column (`-1` marks padded columns). - `sinkhorn_attentions` `(B, num_sinkhorn_heads, num_buckets, W, 2 * W)` — the real per-bucket softmax weights over the `[reordered-bucket | own-bucket]` key axis. - `sinkhorn_reorder` `(B, num_sinkhorn_heads, num_buckets, num_buckets)` — the exact bucket-permutation matrix; for query bucket `u`, the nonzero column `v` of row `u` says the reordered key bucket (columns `0:W` of `sinkhorn_attentions`) is source bucket `v` (absolute positions `v*W : v*W + W`). - scalar metadata: `bucket_size`, `look_backward`, `look_forward`, `num_local_heads`, `num_sinkhorn_heads`, `sequence_length`. `W` is `bucket_size`; local heads come first along the head axis, Sinkhorn heads second. These are **structured block weights, not dense attention matrices** — re-deriving the per-type attention output by contracting these exact weights with the (block-gathered) values reproduces the layer output exactly. Recording is opt-in, so the default forward path and its numerics are byte-for-byte unchanged. ```python >>> import torch >>> from multimolecule import SpTransformerConfig, SpTransformerModel >>> config = SpTransformerConfig(bucket_size=4, max_seq_len=16, context=2, num_hidden_layers=2) >>> model = SpTransformerModel(config).eval() >>> output = model(torch.randint(config.vocab_size, (1, 16)), output_attentions=True) >>> layer0 = output.attentions[0] >>> layer0.local_attentions.shape torch.Size([1, 2, 4, 4, 12]) >>> layer0.sinkhorn_attentions.shape torch.Size([1, 6, 4, 4, 8]) >>> layer0.sinkhorn_reorder.shape torch.Size([1, 6, 4, 4]) ``` ## Training Details SpTransformer was trained to predict tissue-specific splicing from primary pre-mRNA sequence. ### Training Data SpTransformer was trained on splicing measurements derived from RNA-seq data across 15 human tissues, using gene annotations from [GENCODE](https://multimolecule.danling.org/datasets/gencode), together with multi-species sequence data. The two convolutional feature extractors were pre-trained as SpliceAI-style splice-site predictors; MultiMolecule exposes them as trainable submodules for downstream fine-tuning. For each predicted nucleotide, a sequence window centered on that nucleotide was used, with the flanking context padded with `N` (unknown nucleotide) when near transcript ends. ### Training Procedure #### Pre-training The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over per-tissue splice-site usage, comparing predictions against measurements derived from RNA-seq. ## Citation ```bibtex @article{You2024, author = {You, Ningyuan and Liu, Chang and Gu, Yuxin and Wang, Rong and Jia, Hanying and Zhang, Tianyun and Jiang, Song and Shi, Jinsong and Chen, Ming and Guan, Min-Xin and Sun, Siqi and Pei, Shanshan and Liu, Zhihong and Shen, Ning}, title = {{SpliceTransformer predicts tissue-specific splicing linked to human diseases}}, journal = {Nature Communications}, year = {2024}, volume = {15}, number = {1}, pages = {9129}, month = {oct}, doi = {10.1038/s41467-024-53088-6}, issn = {2041-1723}, url = {https://doi.org/10.1038/s41467-024-53088-6} } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [SpliceTransformer paper](https://doi.org/10.1038/s41467-024-53088-6) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```