ZhiyuanChen's picture
Upload folder using huggingface_hub
77c143d verified
---
datasets:
- multimolecule/gencode
library_name: multimolecule
license: agpl-3.0
pipeline: splice-site
pipeline_tag: other
tags:
- Biology
- Genomics
- RNA
- Splicing
- rna
widget:
- example_title: microRNA 21
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: UAGCUUAUCAGACUGAUGUUGA
- example_title: microRNA 146a
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: UGAGAACUGAAUUCCAUGGGUU
- example_title: microRNA 155
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: UUAAUGCUAAUCGUGAUAGGGGUU
- example_title: RNA component of mitochondrial RNA processing endoribonuclease
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: GGUUCGUGCUGAAGGCCUGUAUCCUAGGCUACACACUGAGGACUCUGUUCCUCCCCUUUCCGCCUAGGGGAAAGUCCCCGGACCUCGGGCAGAGAGUGCCACGUGCAUACGCACGUAGACAUUCCCCGCUUCCCACUCCAAAGUCCGCCAAGAAGCGUAUCCCGCUGAGCGGCGUGGCGCGGGGGCGUCAUCCGUCAGCUCCCUCUAGUUACGCAGGCAGUGCGUGUCCGCGCACCAACCACACGGGGCUCAUUCUCAGCGCGGCUGUAAAAAAAAA
- example_title: 7SK small nuclear RNA
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: GGAUGUGAGGGCGAUCUGGCUGCGACAUCUGUCACCCCAUUGAUCGCCAGGGUUGAUUCGGCUGAUCUGGCUGGCUAGGCGGGUGUCCCCUUCCUCCCUCACCGCUCCAUGUGCGUCCCUCCCGAAGCUGCGCGCUCGGUCGAAGAGGACGACCAUCCCCGAUAGAGGAGGACCGGUCUUCGGUCAAGGGUAUACGAGUAGCUGCGCUCCCCUGCUAGAACCUCCAAACAAGCUCUCAAGGUCCAUUUGUAGGAGAACGUAGGGUAGUCAAGCUUCCAAGACUCCAGACACAUCCAAAUGAGGCGCUGCAUGUGGCAGUCUGCCUUUCUUUU
- example_title: telomerase RNA component
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: GGGUUGCGGAGGGUGGGCCUGGGAGGGGUGGUGGCCAUUUUUUGUCUAACCCUAACUGAGAAGGGCGUAGGCGCCGUGCUUUUGCUCCCCGCGCGCUGUUUUUCUCGCUGACUUUCAGCGGGCGGAAAAGCCUCGGCCUGCCGCCUUCCACCGUUCAUUCUAGAGCAAACAAAAAAUGUCAGCUGCUGGCCCGUUCGCCCCUCCCGGGGACCUGCGGCGGGUCGCCUGCCCAGCCCCCGAACCCCGCCUGGAGGCCGCGGUCGGCCCGGGGCUUCUCCGGAGGCACCCACUGCCACCGCGAAGAGUUGGGCUCUGUCAGCCGCGGGUCUCUCGGGGGCGAGGGCGAGGUUCAGGCCUUUCAGGCCGCAGGAAGAGGAACGGAGCGAGUCCCCGCGCGCGGCGCGAUUCCCUGAGCUGUGGGACGUGCACCCAGGACUCGGCUCACACAUGC
- example_title: vault RNA 2-1
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: CGGGUCGGAGUUAGCUCAAGCGGUUACCUCCUCAUGCCGGACUUUCUAUCUGUCCAUCUCUGUGCUGGGGUUCGAGACCCGCGGGUGCUUACUGACCCUUUUAUGCAA
- example_title: brain cytoplasmic RNA 1
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: GGCCGGGCGCGGUGGCUCACGCCUGUAAUCCCAGCUCUCAGGGAGGCUAAGAGGCGGGAGGAUAGCUUGAGCCCAGGAGUUCGAGACCUGCCUGGGCAAUAUAGCGAGACCCCGUUCUCCAGAAAAAGGAAAAAAAAAAACAAAAGACAAAAAAAAAAUAAGCGUAACUUCCCUCAAAGCAACAACCCCCCCCCCCCUUU
- example_title: HIV-1 TAR-WT
pipeline_tag: splice-site
sequence_type: ncRNA
task: splice-site
text: GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACC
- example_title: prion protein (Kanno blood group)
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGGCGAACCUUGGCUGCUGGAUGCUGGUUCUCUUUGUGGCCACAUGGAGUGACCUGGGCCUCUGC
- example_title: interleukin 10
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGCACAGCUCAGCACUGCUCUGUUGCCUGGUCCUCCUGACUGGGGUGAGGGCC
- example_title: Zaire ebolavirus
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AAUGUUCAAACACUUUGUGAAGCUCUGUUAGCUGAUGGUCUUGCUAAAGCAUUUCCUAGCAAUAUGAUGGUAGUCACAGAGCGUGAGCAAAAAGAAAGCUUAUUGCAUCAAGCAUCAUGGCACCACACAAGUGAUGAUUUUGGUGAGCAUGCCACAGUUAGAGGGAGUAGCUUUGUAACUGAUUUAGAGAAAUACAAUCUUGCAUUUAGAUAUGAGUUUACAGCACCUUUUAUAGAAUAUUGUAACCGUUGCUAUGGUGUUAAGAAUGUUUUUAAUUGGAUGCAUUAUACAAUCCCACAGUGUUAU
- example_title: SARS coronavirus
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGUUUAUUUUCUUAUUAUUUCUUACUCUCACUAGUGGUAGUGACCUUGACCGGUGCACCACUUUUGAUGAUGUUCAAGCUCCUAAUUACACUCAACAUACUUCAUCUAUGAGGGGGGUUUACUAUCCUGAUGAAAUUUUUAGAUCAGACACUCUUUAUUUAACUCAGGAUUUAUUUCUUCCAUUUUAUUCUAAUGUUACAGGGUUUCAUACUAUUAAUCAUACGUUUGACAACCCUGUCAUACCUUUUAAGGAUGGUAUUUAUUUUGCUGCCACAGAGAAAUCAAAUGUUGUCCGUGGUUGGGUUUUUGGUUCUACCAUGAACAACAAGUCACAGUCGGUGAUUAUUAUUAACAAUUCUACUAAUGUUGUUAUACGAGCAUGUAACUUUGAAUUGUGUGACAACCCUUUCUUUGCUGUUUCUAAACCCAUGGGUACACAGACACAUACUAUGAUAUUCGAUAAUGCAUUUAAAUGCACUUUCGAGUACAUAUCU
- example_title: insulin
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGGCCCUGUGGAUGCGCCUCCUGCCCCUGCUGGCGCUGCUGGCCCUCUGGGGACCUGACCCAGCCGCAGCCUUUGUGAACCAACACCUGUGCGGCUCACACCUGGUGGAAGCUCUCUACCUAGUGUGCGGGGAACGAGGCUUCUUCUACACACCCAAGACCCGCCGGGAGGCAGAGGACCUGCAGGUGGGGCAGGUGGAGCUGGGCGGGGGCCCUGGUGCAGGCAGCCUGCAGCCCUUGGCCCUGGAGGGGUCCCUGCAGAAGCGUGGCAUUGUGGAACAAUGCUGUACCAGCAUCUGCUCCCUCUACCAGCUGGAGAACUACUGCAACUAG
- example_title: cyclin dependent kinase inhibitor 2A
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGGAGCCGGCGGCGGGGAGCAGCAUGGAGCCUUCGGCUGACUGGCUGGCCACGGCCGCGGCCCGGGGUCGGGUAGAGGAGGUGCGGGCGCUGCUGGAGGCGGGGGCGCUGCCCAACGCACCGAAUAGUUACGGUCGGAGGCCGAUCCAGGUCAUGAUGAUGGGCAGCGCCCGAGUGGCGGAGCUGCUGCUGCUCCACGGCGCGGAGCCCAACUGCGCCGACCCCGCCACUCUCACCCGACCCGUGCACGACGCUGCCCGGGAGGGCUUCCUGGACACGCUGGUGGUGCUGCACCGGGCCGGGGCGCGGCUGGACGUGCGCGAUGCCUGGGGCCGUCUGCCCGUGGACCUGGCUGAGGAGCUGGGCCAUCGCGAUGUCGCACGGUACCUGCGCGCGGCUGCGGGGGGCACCAGAGGCAGUAACCAUGCCCGCAUAGAUGCCGCGGAAGGUCCCUCAGACAUCCCCGAUUGA
- example_title: human papillomavirus type 16 E6
pipeline_tag: splice-site
sequence_type: mRNA
task: splice-site
text: AUGCACCAAAAGAGAACUGCAAUGUUUCAGGACCCACAGGAGCGACCCAGAAAGUUACCACAGUUAUGCACAGAGCUGCAAACAACUAUACAUGAUAUAAUAUUAGAAUGUGUGUACUGCAAGCAACAGUUACUGCGACGUGAGGUAUAUGACUUUGCUUUUCGGGAUUUAUGCAUAGUAUAUAGAGAUGGGAAUCCAUAUGCUGUAUGUGAUAAAUGUUUAAAGUUUUAUUCUAAAAUUAGUGAGUAUAGACAUUAUUGUUAUAGUUUGUAUGGAACAACAUUAGAACAGCAAUACAACAAACCGUUGUGUGAUUUGUUAAUUAGGUGUAUUAACUGUCAAAAGCCACUGUGUCCUGAAGAAAAGCAAAGACAUCUGGACAAAAAGCAAAGAUUCCAUAAUAUAAGGGGUCGGUGGACCGGUCGAUGUAUGUCUUGUUGCAGAUCAUCAAGAACACGUAGAGAAACCCAGCUGUAA
- example_title: NRAS proto-oncogene
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: GGGGCCGGAAGUGCCGCUCCUUGGUGGGGGCUGUUCAUGGCGGUUCCGGGGUCUCCAACAUUUUUCCCGGCUGUGGUCCUAAAUCUGUCCAAAGCAGAGGCAGUGGAGCUUGAGGUUCUUGCUGGUGUGAA
- example_title: amyloid beta precursor protein
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: GUCAGUUUCCUCGGCAGCGGUAGGCGAGAGCACGCGGAGGAGCGUGCGCGGGGGCCCCGGGAGACGGCGGCGGUGGCGGCGCGGGCAGAGCAAGGACGCGGCGGAUCCCACUCGCACAGCAGCGCACUCGGUGCCCCGCGCAGGGUCGCG
- example_title: RUNX family transcription factor 1
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: ACUUCUUUGGGCCUCAUAAACAACCACAGAACCACAAGUUGGGUAGCCUGGCAGUGUCAGAAGUCUGAACCCAGCAUAGUGGUCAGCAGGCAGGACGAAUCACACUGAAUGCAAACCACAGGGUUUCGCAGCGUGGUAAAAGAAAUCAUUGAGUCCCCCGCCUUCAGAAGAGGGUGCAUUUUCAGGAGGAAGCG
- example_title: fragile X messenger ribonucleoprotein 1
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: CUCAGUCAGGCGCUCAGCUCCGUUUCGGUUUCACUUCCGGUGGAGGGCCGCCUCUGAGCGGGCGGCGGGCCGACGGCGAGCGCGGGCGGCGGCGGUGACGGAGGCGCCGCUGCCAGGGGGCGUGCGGCAGCGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGAGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCUGGGCCUCGAGCGCCCGCAGCCCACCUCUCGGGGGCGGGCUCCCGGCGCUAGCAGGGCUGAAGAGAAG
- example_title: MYC proto-oncogene
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: AACUCGCUGUAGUAAUUCCAGCGAGAGGCAGAGGGAGCGAGCGGGCGGCCGGCUAGGGUGGAAGAGCCGGGCGAGCAGAGCUGCGCUGCGGGCGUCCUGGGAAGGGAGAUCCGGAGCGAAUAGGGGGCUUCGCCUCUGGCCCAGCCCUCCCGCUGAUCCCCCAGCCAGCGGUCCGCAACCCUUGCCGCAUCCACGAAACUUUGCCCAUAGCAGCGGGCGGGCACUUUGCACUGGAACUUACAACACCCGAGCAAGGACGCGACUCUCCCGACGCGGGGAGGCUAUUCUGCCCAUUUGGGGACACUUCCCCGCCGCUGCCAGGACCCGCUUCUCUGAAAGGCUCUCCUUGCAGCUGCUUAGACG
- example_title: activating transcription factor 4
pipeline_tag: splice-site
sequence_type: 5' UTR
task: splice-site
text: CAUUUCUACUUUGCCCGCCCACAGAUGUAGUUUUCUCUGCGCGUGUGCGUUUUCCCUCCUCCCCGCCCUCAGGGUCCACGGCCACCAUGGCGUAUUAGGGGCAGCAGUGCCUGCGGCAGCAUUGGCCUUUGCAGCGGCGGCAGCAGCACCAGGCUCUGCAGCGGCAACCCCCAGCGGCUUAAGCCAUGGCGCUUCUCACGGCAUUCAGCAGCAGCGUUGCUGUAACCGACAAAGACACCUUCGAAUUAAGCACAUUCCUCGAUUCCAGCAAAGCACCGCAAC
- example_title: Human GPI protein p137
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: UUUUUAAAAGGAAAAGAUACCAAAUGCCUGCUGCUACCACCCUUUUCAAUUGCUAUGUUUUGAAAGGCACCAGUAUGUGUUUUAGAUUGAUUUAAAUGUUUCAUUUAAAUCACGGACAGUAGUUUCAGUUCUGAUGGUAUAAGCAAAACAAAUAAAACGUUUAUAAAAGUUGUAUCUUGAAACACUGGUGUUCAACAGCUAGCAGCUUAUGUGAUUCACCCCAUGCCACGUUAGUGUCACAAAUUUUAUGGUUUAUCUCCAGCAACAUUUCUCUAGUACUUGCACUUAUUAUCUGAAUUC
- example_title: nucleophosmin 1
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: GAAAAUAGUUUAAACAAUUUGUUAAAAAAUUUUCCGUCUUAUUUCAUUUCUGUAACAGUUGAUAUCUGGCUGUCCUUUUUAUAAUGCAGAGUGAGAACUUUCCCUACCGUGUUUGAUAAAUGUUGUCCAGGUUCUAUUGCCAAGAAUGUGUUGUCCAAAAUGCCUGUUUAGUUUUUAAAGAUGGAACUCCACCCUUUGCUUGGUUUUAAGUAUGUAUGGAAUGUUAUGAUAGGACAUAGUAGUAGCGGUGGUCAGACAUGGAAAUGGUGGGGAGACAAAAAUAUACAUGUGAAAUAAAACUCAGUAUUUUAAUAAAGUAGCACGGUUUCUAUUGA
- example_title: superoxide dismutase 1
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: ACAUUCCCUUGGAUGUAGUCUGAGGCCCCUUAACUCAUCUGUUAUCCUGCUAGCUGUAGAAAUGUAUCCUGAUAAACAUUAAACACUGUAAUCUUAAAAGUGUAAUUGUGUGACUUUUUCAGAGUUGCUUUAAAGUACCUGUAGUGAGAAACUGAUUUAUGAUCACUUGGAAGAUUUGUAUAGUUUUAUAAAACUCAGUUAAAAUGUCUGUUUCAAUGACCUGUAUUUUGCCAGACUUAAAUCACAGAUGGGUAUUAAACUUGUCAGAAUUUCUUUGUCAUUCAAGCCUGUGAAUAAAAACCCUGUAUGGCACUUAUUAUGAGGCUAUUAAAAGAAUCCAAAUUCAAACUAAA
- example_title: hemoglobin subunit alpha 2
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: CUGGAGCCUCGGUAGCCGUUCCUCCUGCCCGCUGGGCCUCCCAACGGGCCCUCCUCCCCUCCUUGCACCGGCCCUUCCUGGUCUUUGAAUAAAGUCUGAGUGGGCAGCA
- example_title: BRAF proto-oncogene
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: AACAAAUGAGUGAGAGAGUUCAGGAGAGUAGCAACAAAAGGAAAAUAAAUGAACAUAUGUUUGCUUAUAUGUUAAAUUGAAUAAAAUACUCUCUUUUUUUUUAAGGUGAACCAAAGAACACUUGUGUGGUUAAAGACUAGAUAUAAUUUUUCCCCAAACUAAAAUUUAUACUUAACAUUGGAUUUUUAACAUCCAAGGGUUAAAAUACAUAGACAUUGCUAAAAAUUGGCAGAGCCUCUUCUAGAGGCUUUACUUUCUGUUCCGGGUUUGUAUCAUUCACUUGGUUAUUUUAAGUAGUAAACUUCAGUUUCUCAUGCAACUUUUGUUGCCAGCUAUCACAUGUCCACUAGGGACUCCAGAAGAAGACCCUACCUAUGCCUGUGUUUGCAGGUGAGAAGUUGGCAGUCGGUUAGCCUGGG
- example_title: H3 clustered histone 1
pipeline_tag: splice-site
sequence_type: 3' UTR
task: splice-site
text: UUACUGUGGUCUCUCUGACGGUCCAAGCAAAGGCUCUUUUCAGAGCCACCACCUUUUC
---
# OpenSpliceAI
Modular native-PyTorch reimplementation of SpliceAI for predicting pre-mRNA splice sites from nucleotide sequence.
## Disclaimer
This is an UNOFFICIAL implementation of [OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species](https://doi.org/10.7554/eLife.107454.3) by Kuan-Hao Chao, Alan Mao, et al.
The OFFICIAL repository of OpenSpliceAI is at [Kuanhao-Chao/OpenSpliceAI](https://github.com/Kuanhao-Chao/OpenSpliceAI).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing OpenSpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
OpenSpliceAI is a deep dilated residual convolutional neural network that reimplements the SpliceAI architecture in native PyTorch. It predicts, for each nucleotide of a pre-mRNA transcript, whether the position is a splice acceptor, a splice donor, or neither. The model stacks dilated residual units with increasing kernel size and atrous rate so that a wide genomic context window contributes to each per-nucleotide prediction, while skip connections aggregate multi-scale features. OpenSpliceAI reproduces the predictive behavior of SpliceAI while providing an efficient, modular training pipeline that can be retrained on non-human species.
### Variants
OpenSpliceAI ships trained model families for human MANE and four non-human species. Each family provides four
flanking-context sizes.
| Family | 80 nt | 400 nt | 2,000 nt | 10,000 nt |
| ------------- | ------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| MANE / human | [`openspliceai-mane.80`](https://huggingface.co/multimolecule/openspliceai-mane.80) | [`openspliceai-mane.400`](https://huggingface.co/multimolecule/openspliceai-mane.400) | [`openspliceai-mane.2000`](https://huggingface.co/multimolecule/openspliceai-mane.2000) | [`openspliceai-mane.10000`](https://huggingface.co/multimolecule/openspliceai-mane.10000) |
| Mouse | [`openspliceai-mouse.80`](https://huggingface.co/multimolecule/openspliceai-mouse.80) | [`openspliceai-mouse.400`](https://huggingface.co/multimolecule/openspliceai-mouse.400) | [`openspliceai-mouse.2000`](https://huggingface.co/multimolecule/openspliceai-mouse.2000) | [`openspliceai-mouse.10000`](https://huggingface.co/multimolecule/openspliceai-mouse.10000) |
| Zebrafish | [`openspliceai-zebrafish.80`](https://huggingface.co/multimolecule/openspliceai-zebrafish.80) | [`openspliceai-zebrafish.400`](https://huggingface.co/multimolecule/openspliceai-zebrafish.400) | [`openspliceai-zebrafish.2000`](https://huggingface.co/multimolecule/openspliceai-zebrafish.2000) | [`openspliceai-zebrafish.10000`](https://huggingface.co/multimolecule/openspliceai-zebrafish.10000) |
| Honeybee | [`openspliceai-honeybee.80`](https://huggingface.co/multimolecule/openspliceai-honeybee.80) | [`openspliceai-honeybee.400`](https://huggingface.co/multimolecule/openspliceai-honeybee.400) | [`openspliceai-honeybee.2000`](https://huggingface.co/multimolecule/openspliceai-honeybee.2000) | [`openspliceai-honeybee.10000`](https://huggingface.co/multimolecule/openspliceai-honeybee.10000) |
| _Arabidopsis_ | [`openspliceai-arabidopsis.80`](https://huggingface.co/multimolecule/openspliceai-arabidopsis.80) | [`openspliceai-arabidopsis.400`](https://huggingface.co/multimolecule/openspliceai-arabidopsis.400) | [`openspliceai-arabidopsis.2000`](https://huggingface.co/multimolecule/openspliceai-arabidopsis.2000) | [`openspliceai-arabidopsis.10000`](https://huggingface.co/multimolecule/openspliceai-arabidopsis.10000) |
### Model Specification
<table>
<thead>
<tr>
<th>Flanking Context</th>
<th>Residual Blocks</th>
<th>Hidden Size</th>
<th>Num Parameters (M)</th>
<th>FLOPs (G)</th>
<th>MACs (G)</th>
</tr>
</thead>
<tbody>
<tr>
<td>80 nt</td>
<td>4</td>
<td rowspan="4">32</td>
<td>0.09</td>
<td>0.95</td>
<td>0.47</td>
</tr>
<tr>
<td>400 nt</td>
<td>8</td>
<td>0.19</td>
<td>2.00</td>
<td>0.99</td>
</tr>
<tr>
<td>2,000 nt</td>
<td>12</td>
<td>0.36</td>
<td>5.03</td>
<td>2.50</td>
</tr>
<tr>
<td>10,000 nt</td>
<td>16</td>
<td>0.70</td>
<td>20.90</td>
<td>10.40</td>
</tr>
</tbody>
</table>
Model size is determined by flanking context and is shared across species for the same context. FLOPs and MACs are
reported for a single 5,000-nucleotide output sequence.
### Links
- **Code**: [multimolecule.openspliceai](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/openspliceai)
- **Data**: Human MANE/GENCODE for the MANE variants; species annotations follow the original OpenSpliceAI release.
- **Paper**: [OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species](https://doi.org/10.7554/eLife.107454.3)
- **Developed by**: Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg, Mihaela Pertea
- **Model type**: Dilated residual 1D CNN over pre-mRNA sequence for per-nucleotide three-class splice-site classification
- **Original Repository**: [Kuanhao-Chao/OpenSpliceAI](https://github.com/Kuanhao-Chao/OpenSpliceAI)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### RNA Splicing Site Prediction
You can use this model directly to predict the splice sites of a pre-mRNA sequence:
```python
>>> from multimolecule import RnaTokenizer, OpenSpliceAiForTokenPrediction
>>> model_id = "multimolecule/openspliceai-mouse.2000"
>>> tokenizer = RnaTokenizer.from_pretrained(model_id)
>>> model = OpenSpliceAiForTokenPrediction.from_pretrained(model_id)
>>> output = model(tokenizer("AGCAGUCAUUAUGGCGAA", return_tensors="pt")["input_ids"])
>>> output.keys()
odict_keys(['logits'])
```
Each output position carries three logits corresponding to _neither_, _acceptor_, and _donor_.
### Interface
- **Input length**: variable pre-mRNA sequence
- **Flanking context**: 80 / 400 / 2,000 / 10,000 nt per variant family, split evenly on both sides of every predicted position
- **Padding**: sequence ends padded with `N`
- **Output**: per-position 3-class logits (`neither`, `acceptor`, `donor`)
## Training Details
OpenSpliceAI was trained to predict the location of splice donor and acceptor sites from nucleotide sequence, following the SpliceAI training methodology.
### Training Data
The MANE variants were trained on transcripts from the [GENCODE](https://multimolecule.danling.org/datasets/gencode)/MANE human reference annotation. The non-human variants use the species annotations released by OpenSpliceAI for mouse, zebrafish, honeybee, and _Arabidopsis_. For each predicted nucleotide, the model receives a flanking context of 80, 400, 2,000, or 10,000 nucleotides, split evenly across the two sides of the output sequence, with sequence ends padded with `N`. Annotated splice donor and acceptor sites serve as positive labels; all other positions are negative.
### Training Procedure
#### Pre-training
The model was trained to minimize a cross-entropy loss between predicted splice-site probabilities and the reference annotation.
- Optimizer: Adam
- Loss: cross-entropy
Please refer to the [OpenSpliceAI paper](https://doi.org/10.7554/eLife.107454.3) for the full training protocol and hardware details.
## Citation
```bibtex
@article{chao2025openspliceai,
author = {Chao, Kuan-Hao and Mao, Alan and Liu, Anqi and Salzberg, Steven L and Pertea, Mihaela},
title = {OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species},
journal = {eLife},
volume = 14,
pages = {RP107454},
year = 2025,
doi = {10.7554/eLife.107454.3},
publisher = {eLife Sciences Publications, Ltd}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If MultiMolecule supports your research, please cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [OpenSpliceAI paper](https://doi.org/10.7554/eLife.107454.3) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```