Instructions to use multimolecule/openspliceai-mouse-2000nt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/openspliceai-mouse-2000nt with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/openspliceai-mouse-2000nt") model = AutoModel.from_pretrained("multimolecule/openspliceai-mouse-2000nt") - Notebooks
- Google Colab
- Kaggle
| language: dna | |
| tags: | |
| - Biology | |
| - DNA | |
| - Genomics | |
| license: agpl-3.0 | |
| datasets: | |
| - multimolecule/gencode | |
| library_name: multimolecule | |
| # OpenSpliceAI | |
| Modular native-PyTorch reimplementation of SpliceAI for predicting pre-mRNA splice sites from primary DNA sequence. | |
| ## Disclaimer | |
| This is an UNOFFICIAL implementation of [OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species](https://doi.org/10.7554/eLife.107454.3) by Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg and Mihaela Pertea. | |
| The OFFICIAL repository of OpenSpliceAI is at [Kuanhao-Chao/OpenSpliceAI](https://github.com/Kuanhao-Chao/OpenSpliceAI). | |
| > [!TIP] | |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. | |
| **The team releasing OpenSpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.** | |
| ## Model Details | |
| OpenSpliceAI is a deep dilated residual convolutional neural network that reimplements the SpliceAI architecture in native PyTorch. It predicts, for each nucleotide of a pre-mRNA transcript, whether the position is a splice acceptor, a splice donor, or neither. The model stacks dilated residual units with increasing kernel size and atrous rate so that a wide genomic context window contributes to each per-nucleotide prediction, while skip connections aggregate multi-scale features. OpenSpliceAI reproduces the predictive behavior of SpliceAI while providing an efficient, modular training pipeline that can be retrained on non-human species. | |
| ### Variations | |
| OpenSpliceAI ships trained model families for human MANE and four non-human species. Each family provides four | |
| flanking-context sizes. MultiMolecule publishes one seed (`rs10`) for each family/context pair; the other upstream seed | |
| checkpoints are training replicates and are not exposed as separate model variants. | |
| | Family | 80 nt | 400 nt | 2,000 nt | 10,000 nt | | |
| | ------------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | | |
| | MANE / human | [`openspliceai-mane-80nt`](https://huggingface.co/multimolecule/openspliceai-mane-80nt) | [`openspliceai-mane-400nt`](https://huggingface.co/multimolecule/openspliceai-mane-400nt) | [`openspliceai-mane-2000nt`](https://huggingface.co/multimolecule/openspliceai-mane-2000nt) | [`openspliceai-mane-10000nt`](https://huggingface.co/multimolecule/openspliceai-mane-10000nt) | | |
| | Mouse | [`openspliceai-mouse-80nt`](https://huggingface.co/multimolecule/openspliceai-mouse-80nt) | [`openspliceai-mouse-400nt`](https://huggingface.co/multimolecule/openspliceai-mouse-400nt) | [`openspliceai-mouse-2000nt`](https://huggingface.co/multimolecule/openspliceai-mouse-2000nt) | [`openspliceai-mouse-10000nt`](https://huggingface.co/multimolecule/openspliceai-mouse-10000nt) | | |
| | Zebrafish | [`openspliceai-zebrafish-80nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-80nt) | [`openspliceai-zebrafish-400nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-400nt) | [`openspliceai-zebrafish-2000nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-2000nt) | [`openspliceai-zebrafish-10000nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-10000nt) | | |
| | Honeybee | [`openspliceai-honeybee-80nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-80nt) | [`openspliceai-honeybee-400nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-400nt) | [`openspliceai-honeybee-2000nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-2000nt) | [`openspliceai-honeybee-10000nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-10000nt) | | |
| | _Arabidopsis_ | [`openspliceai-arabidopsis-80nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-80nt) | [`openspliceai-arabidopsis-400nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-400nt) | [`openspliceai-arabidopsis-2000nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-2000nt) | [`openspliceai-arabidopsis-10000nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-10000nt) | | |
| ### Model Specification | |
| | Flanking Context | Residual Blocks | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | | |
| | ---------------- | --------------- | ----------- | ------------------ | --------- | -------- | | |
| | 80 nt | 4 | 32 | 0.09 | 0.95 | 0.47 | | |
| | 400 nt | 8 | 32 | 0.19 | 2.00 | 0.99 | | |
| | 2,000 nt | 12 | 32 | 0.36 | 5.03 | 2.50 | | |
| | 10,000 nt | 16 | 32 | 0.70 | 20.90 | 10.40 | | |
| Model size is determined by flanking context and is shared across species for the same context. FLOPs and MACs are | |
| reported for a single 5,000-nucleotide output sequence. | |
| ### Links | |
| - **Code**: [multimolecule.openspliceai](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/openspliceai) | |
| - **Weights**: See the 20 variant repositories listed above. | |
| - **Data**: Human MANE/GENCODE for the MANE variants; species annotations follow the original OpenSpliceAI release. | |
| - **Paper**: [OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species](https://doi.org/10.7554/eLife.107454.3) | |
| - **Developed by**: Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg, Mihaela Pertea | |
| - **Original Repository**: [Kuanhao-Chao/OpenSpliceAI](https://github.com/Kuanhao-Chao/OpenSpliceAI) (licensed GPL-3.0) | |
| ## Usage | |
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: | |
| ```bash | |
| pip install multimolecule | |
| ``` | |
| ### Direct Use | |
| #### RNA Splicing Site Prediction | |
| You can use this model directly to predict the splice sites of a pre-mRNA sequence: | |
| ```python | |
| >>> from multimolecule import DnaTokenizer, OpenSpliceAiForTokenPrediction | |
| >>> model_id = "multimolecule/openspliceai-mouse-2000nt" | |
| >>> tokenizer = DnaTokenizer.from_pretrained(model_id) | |
| >>> model = OpenSpliceAiForTokenPrediction.from_pretrained(model_id) | |
| >>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"]) | |
| >>> output.keys() | |
| odict_keys(['logits']) | |
| ``` | |
| Each output position carries three logits corresponding to _neither_, _acceptor_, and _donor_. | |
| ## Training Details | |
| OpenSpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence, following the SpliceAI training methodology. | |
| ### Training Data | |
| The MANE variants were trained on transcripts from the [GENCODE](https://multimolecule.danling.org/datasets/gencode)/MANE human reference annotation. The non-human variants use the species annotations released by OpenSpliceAI for mouse, zebrafish, honeybee, and _Arabidopsis_. For each predicted nucleotide, the model receives a flanking context of 80, 400, 2,000, or 10,000 nucleotides, split evenly across the two sides of the output sequence, with sequence ends padded with `N`. Annotated splice donor and acceptor sites serve as positive labels; all other positions are negative. | |
| ### Training Procedure | |
| #### Pre-training | |
| The model was trained to minimize a cross-entropy loss between predicted splice-site probabilities and the reference annotation. | |
| - Architecture: dilated residual 1D convolutions with skip connections | |
| - Activation: LeakyReLU (slope 0.1) | |
| - Optimizer: Adam | |
| - Loss: cross-entropy | |
| - Flanking context sizes: 80 / 400 / 2000 / 10000 nucleotides | |
| Please refer to the [OpenSpliceAI paper](https://doi.org/10.7554/eLife.107454.3) for the full training protocol and hardware details. | |
| ## Citation | |
| ```bibtex | |
| @article{chao2025openspliceai, | |
| author = {Chao, Kuan-Hao and Mao, Alan and Liu, Anqi and Salzberg, Steven L and Pertea, Mihaela}, | |
| title = {OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species}, | |
| journal = {eLife}, | |
| volume = 14, | |
| pages = {RP107454}, | |
| year = 2025, | |
| doi = {10.7554/eLife.107454.3}, | |
| publisher = {eLife Sciences Publications, Ltd} | |
| } | |
| ``` | |
| > [!NOTE] | |
| > The artifacts distributed in this repository are part of the MultiMolecule project. | |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: | |
| ```bibtex | |
| @software{chen_2024_12638419, | |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, | |
| title = {MultiMolecule}, | |
| doi = {10.5281/zenodo.12638419}, | |
| publisher = {Zenodo}, | |
| url = {https://doi.org/10.5281/zenodo.12638419}, | |
| year = 2024, | |
| month = may, | |
| day = 4 | |
| } | |
| ``` | |
| ## Contact | |
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. | |
| Please contact the authors of the [OpenSpliceAI paper](https://doi.org/10.7554/eLife.107454.3) for questions or comments on the paper/model. | |
| ## License | |
| This model implementation is licensed under the [GNU Affero General Public License](license.md). | |
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). | |
| ```spdx | |
| SPDX-License-Identifier: AGPL-3.0-or-later | |
| ``` | |