File size: 7,432 Bytes
9fd1c52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
language: dna
tags:
  - Biology
  - DNA
  - RNA
  - Splicing
license: agpl-3.0
library_name: multimolecule
---

# MMSplice

Modular modeling of the effects of genetic variants on splicing.

## Disclaimer

This is an UNOFFICIAL implementation of the [MMSplice: modular modeling improves the predictions of genetic variant effects on splicing](https://doi.org/10.1186/s13059-019-1653-z) by Jun Cheng, Thi Yen Duong Nguyen, Kamil J. Cygan, Muhammed Hasan Çelik, William G. Fairbrother, Žiga Avsec and Julien Gagneur.

The OFFICIAL repository of MMSplice is at [gagneurlab/MMSplice_MTSplice](https://github.com/gagneurlab/MMSplice_MTSplice).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing MMSplice did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

MMSplice is a _modular_ neural network for predicting the effect of genetic variants on pre-mRNA splicing. Instead of one monolithic network, MMSplice decomposes an exon together with its flanking introns into five regions and scores each region with an independent small convolutional sub-network:

- `acceptor_intron`: the intron stub upstream of the 3' splice site.
- `acceptor`: the 3' splice site (acceptor) region with a short exon flank.
- `exon`: the exon body.
- `donor`: the 5' splice site (donor) region with a short exon flank.
- `donor_intron`: the intron stub downstream of the 5' splice site.

Each sub-network consumes a one-hot encoded DNA sequence (a stack of convolution blocks followed by a small dense head) and emits a single scalar score. The five scalar scores form the module score vector. For variant-effect estimation, the model is run on both the reference and the alternative sequence and the per-module score deltas are combined by the fixed upstream linear model into a delta-logit-PSI splicing-effect score. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Variant Effect Interface

MMSplice exposes variant effects as an input-schema concern, not a separate output type:

- Reference-only call (`input_ids` / `inputs_embeds`): returns the per-module score vector `logits` of shape `(batch_size, 5)`.
- Reference + alternative call (also pass `alternative_input_ids` / `alternative_inputs_embeds`): additionally returns `alternative_logits` and the per-module deltas `delta_logits` (`alternative_logits - logits`).
- `MMSpliceForSequencePrediction` requires a reference and alternative sequence and returns the upstream scalar delta-logit-PSI score with shape `(batch_size, 1)`.

MMSplice inputs are exon sequences with 100 nt of upstream intronic context and 100 nt of downstream intronic context.

### Model Specification

| Num Modules | Num Parameters | FLOPs (M) | MACs (M) |
| ----------- | -------------- | --------- | -------- |
| 5           | 56,677         | 5.71      | 2.79     |

(FLOPs and MACs measured on a 220 bp exon-with-flanks input.)

### Links

- **Code**: [multimolecule.mmsplice](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/mmsplice)
- **Paper**: [MMSplice: modular modeling improves the predictions of genetic variant effects on splicing](https://doi.org/10.1186/s13059-019-1653-z)
- **Developed by**: Jun Cheng, Thi Yen Duong Nguyen, Kamil J. Cygan, Muhammed Hasan Çelik, William G. Fairbrother, Žiga Avsec, Julien Gagneur
- **Original Repository**: [gagneurlab/MMSplice_MTSplice](https://github.com/gagneurlab/MMSplice_MTSplice)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### Module Scores

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, MMSpliceForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/mmsplice")
>>> model = MMSpliceForSequencePrediction.from_pretrained("multimolecule/mmsplice")
>>> left_intron = "A" * 100
>>> exon = "C" * 20
>>> right_intron = "G" * 100
>>> reference = tokenizer(left_intron + exon + right_intron, return_tensors="pt")
>>> output = model.model(**reference)
>>> output["logits"].shape
torch.Size([1, 5])
```

#### Variant Effect

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, MMSpliceForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/mmsplice")
>>> model = MMSpliceForSequencePrediction.from_pretrained("multimolecule/mmsplice")
>>> left_intron = "A" * 100
>>> exon = "C" * 20
>>> right_intron = "G" * 100
>>> reference = tokenizer(left_intron + exon + right_intron, return_tensors="pt")
>>> alternative_exon = exon[:10] + "T" + exon[11:]
>>> alternative = tokenizer(left_intron + alternative_exon + right_intron, return_tensors="pt")
>>> output = model(
...     reference["input_ids"],
...     alternative_input_ids=alternative["input_ids"],
... )
>>> output["logits"].shape
torch.Size([1, 1])
```

## Training Details

MMSplice was trained as five independent modules on splicing data and the modules were combined with a linear model to predict variant effects on percent-spliced-in (PSI).

### Training Data

The acceptor, donor, exon, and intron modules were trained on splice-site and exon data derived from human reference transcripts. The combining linear model was fit against a massively parallel reporter assay (MPRA) of exon-skipping variants.

### Training Procedure

#### Pre-training

Each module was trained with a sequence-to-scalar objective scoring its region. The module scores (and their reference/alternative deltas) were then combined by a fixed linear model into a delta-logit-PSI splicing-effect score.

## Citation

```bibtex
@article{cheng2019mmsplice,
  title     = {MMSplice: modular modeling improves the predictions of genetic variant effects on splicing},
  author    = {Cheng, Jun and Nguyen, Thi Yen Duong and Cygan, Kamil J and {\c{C}}elik, Muhammed Hasan and Fairbrother, William G and Avsec, {\v{Z}}iga and Gagneur, Julien},
  journal   = {Genome Biology},
  volume    = 20,
  number    = 1,
  pages     = {48},
  year      = 2019,
  publisher = {Springer},
  doi       = {10.1186/s13059-019-1653-z}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [MMSplice paper](https://doi.org/10.1186/s13059-019-1653-z) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```