mmsplice / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
9fd1c52 verified
---
language: dna
tags:
- Biology
- DNA
- RNA
- Splicing
license: agpl-3.0
library_name: multimolecule
---
# MMSplice
Modular modeling of the effects of genetic variants on splicing.
## Disclaimer
This is an UNOFFICIAL implementation of the [MMSplice: modular modeling improves the predictions of genetic variant effects on splicing](https://doi.org/10.1186/s13059-019-1653-z) by Jun Cheng, Thi Yen Duong Nguyen, Kamil J. Cygan, Muhammed Hasan Çelik, William G. Fairbrother, Žiga Avsec and Julien Gagneur.
The OFFICIAL repository of MMSplice is at [gagneurlab/MMSplice_MTSplice](https://github.com/gagneurlab/MMSplice_MTSplice).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing MMSplice did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
MMSplice is a _modular_ neural network for predicting the effect of genetic variants on pre-mRNA splicing. Instead of one monolithic network, MMSplice decomposes an exon together with its flanking introns into five regions and scores each region with an independent small convolutional sub-network:
- `acceptor_intron`: the intron stub upstream of the 3' splice site.
- `acceptor`: the 3' splice site (acceptor) region with a short exon flank.
- `exon`: the exon body.
- `donor`: the 5' splice site (donor) region with a short exon flank.
- `donor_intron`: the intron stub downstream of the 5' splice site.
Each sub-network consumes a one-hot encoded DNA sequence (a stack of convolution blocks followed by a small dense head) and emits a single scalar score. The five scalar scores form the module score vector. For variant-effect estimation, the model is run on both the reference and the alternative sequence and the per-module score deltas are combined by the fixed upstream linear model into a delta-logit-PSI splicing-effect score. Please refer to the [Training Details](#training-details) section for more information on the training process.
### Variant Effect Interface
MMSplice exposes variant effects as an input-schema concern, not a separate output type:
- Reference-only call (`input_ids` / `inputs_embeds`): returns the per-module score vector `logits` of shape `(batch_size, 5)`.
- Reference + alternative call (also pass `alternative_input_ids` / `alternative_inputs_embeds`): additionally returns `alternative_logits` and the per-module deltas `delta_logits` (`alternative_logits - logits`).
- `MMSpliceForSequencePrediction` requires a reference and alternative sequence and returns the upstream scalar delta-logit-PSI score with shape `(batch_size, 1)`.
MMSplice inputs are exon sequences with 100 nt of upstream intronic context and 100 nt of downstream intronic context.
### Model Specification
| Num Modules | Num Parameters | FLOPs (M) | MACs (M) |
| ----------- | -------------- | --------- | -------- |
| 5 | 56,677 | 5.71 | 2.79 |
(FLOPs and MACs measured on a 220 bp exon-with-flanks input.)
### Links
- **Code**: [multimolecule.mmsplice](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/mmsplice)
- **Paper**: [MMSplice: modular modeling improves the predictions of genetic variant effects on splicing](https://doi.org/10.1186/s13059-019-1653-z)
- **Developed by**: Jun Cheng, Thi Yen Duong Nguyen, Kamil J. Cygan, Muhammed Hasan Çelik, William G. Fairbrother, Žiga Avsec, Julien Gagneur
- **Original Repository**: [gagneurlab/MMSplice_MTSplice](https://github.com/gagneurlab/MMSplice_MTSplice)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### Module Scores
```python
>>> import torch
>>> from multimolecule import DnaTokenizer, MMSpliceForSequencePrediction
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/mmsplice")
>>> model = MMSpliceForSequencePrediction.from_pretrained("multimolecule/mmsplice")
>>> left_intron = "A" * 100
>>> exon = "C" * 20
>>> right_intron = "G" * 100
>>> reference = tokenizer(left_intron + exon + right_intron, return_tensors="pt")
>>> output = model.model(**reference)
>>> output["logits"].shape
torch.Size([1, 5])
```
#### Variant Effect
```python
>>> import torch
>>> from multimolecule import DnaTokenizer, MMSpliceForSequencePrediction
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/mmsplice")
>>> model = MMSpliceForSequencePrediction.from_pretrained("multimolecule/mmsplice")
>>> left_intron = "A" * 100
>>> exon = "C" * 20
>>> right_intron = "G" * 100
>>> reference = tokenizer(left_intron + exon + right_intron, return_tensors="pt")
>>> alternative_exon = exon[:10] + "T" + exon[11:]
>>> alternative = tokenizer(left_intron + alternative_exon + right_intron, return_tensors="pt")
>>> output = model(
... reference["input_ids"],
... alternative_input_ids=alternative["input_ids"],
... )
>>> output["logits"].shape
torch.Size([1, 1])
```
## Training Details
MMSplice was trained as five independent modules on splicing data and the modules were combined with a linear model to predict variant effects on percent-spliced-in (PSI).
### Training Data
The acceptor, donor, exon, and intron modules were trained on splice-site and exon data derived from human reference transcripts. The combining linear model was fit against a massively parallel reporter assay (MPRA) of exon-skipping variants.
### Training Procedure
#### Pre-training
Each module was trained with a sequence-to-scalar objective scoring its region. The module scores (and their reference/alternative deltas) were then combined by a fixed linear model into a delta-logit-PSI splicing-effect score.
## Citation
```bibtex
@article{cheng2019mmsplice,
title = {MMSplice: modular modeling improves the predictions of genetic variant effects on splicing},
author = {Cheng, Jun and Nguyen, Thi Yen Duong and Cygan, Kamil J and {\c{C}}elik, Muhammed Hasan and Fairbrother, William G and Avsec, {\v{Z}}iga and Gagneur, Julien},
journal = {Genome Biology},
volume = 20,
number = 1,
pages = {48},
year = 2019,
publisher = {Springer},
doi = {10.1186/s13059-019-1653-z}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [MMSplice paper](https://doi.org/10.1186/s13059-019-1653-z) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```