OpenSpliceAI

Modular native-PyTorch reimplementation of SpliceAI for predicting pre-mRNA splice sites from primary DNA sequence.

Disclaimer

This is an UNOFFICIAL implementation of OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species by Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg and Mihaela Pertea.

The OFFICIAL repository of OpenSpliceAI is at Kuanhao-Chao/OpenSpliceAI.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing OpenSpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

OpenSpliceAI is a deep dilated residual convolutional neural network that reimplements the SpliceAI architecture in native PyTorch. It predicts, for each nucleotide of a pre-mRNA transcript, whether the position is a splice acceptor, a splice donor, or neither. The model stacks dilated residual units with increasing kernel size and atrous rate so that a wide genomic context window contributes to each per-nucleotide prediction, while skip connections aggregate multi-scale features. OpenSpliceAI reproduces the predictive behavior of SpliceAI while providing an efficient, modular training pipeline that can be retrained on non-human species.

Variations

OpenSpliceAI ships trained model families for human MANE and four non-human species. Each family provides four flanking-context sizes. MultiMolecule publishes one seed (rs10) for each family/context pair; the other upstream seed checkpoints are training replicates and are not exposed as separate model variants.

Family	80 nt	400 nt	2,000 nt	10,000 nt
MANE / human	`openspliceai-mane-80nt`	`openspliceai-mane-400nt`	`openspliceai-mane-2000nt`	`openspliceai-mane-10000nt`
Mouse	`openspliceai-mouse-80nt`	`openspliceai-mouse-400nt`	`openspliceai-mouse-2000nt`	`openspliceai-mouse-10000nt`
Zebrafish	`openspliceai-zebrafish-80nt`	`openspliceai-zebrafish-400nt`	`openspliceai-zebrafish-2000nt`	`openspliceai-zebrafish-10000nt`
Honeybee	`openspliceai-honeybee-80nt`	`openspliceai-honeybee-400nt`	`openspliceai-honeybee-2000nt`	`openspliceai-honeybee-10000nt`
Arabidopsis	`openspliceai-arabidopsis-80nt`	`openspliceai-arabidopsis-400nt`	`openspliceai-arabidopsis-2000nt`	`openspliceai-arabidopsis-10000nt`

Model Specification

Flanking Context	Residual Blocks	Hidden Size	Num Parameters (M)	FLOPs (G)	MACs (G)
80 nt	4	32	0.09	0.95	0.47
400 nt	8	32	0.19	2.00	0.99
2,000 nt	12	32	0.36	5.03	2.50
10,000 nt	16	32	0.70	20.90	10.40

Model size is determined by flanking context and is shared across species for the same context. FLOPs and MACs are reported for a single 5,000-nucleotide output sequence.

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

RNA Splicing Site Prediction

You can use this model directly to predict the splice sites of a pre-mRNA sequence:

>>> from multimolecule import DnaTokenizer, OpenSpliceAiForTokenPrediction

>>> model_id = "multimolecule/openspliceai-honeybee-80nt"
>>> tokenizer = DnaTokenizer.from_pretrained(model_id)
>>> model = OpenSpliceAiForTokenPrediction.from_pretrained(model_id)
>>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"])

>>> output.keys()
odict_keys(['logits'])

Each output position carries three logits corresponding to neither, acceptor, and donor.

Training Details

OpenSpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence, following the SpliceAI training methodology.

Training Data

The MANE variants were trained on transcripts from the GENCODE/MANE human reference annotation. The non-human variants use the species annotations released by OpenSpliceAI for mouse, zebrafish, honeybee, and Arabidopsis. For each predicted nucleotide, the model receives a flanking context of 80, 400, 2,000, or 10,000 nucleotides, split evenly across the two sides of the output sequence, with sequence ends padded with N. Annotated splice donor and acceptor sites serve as positive labels; all other positions are negative.

Training Procedure

Pre-training

The model was trained to minimize a cross-entropy loss between predicted splice-site probabilities and the reference annotation.

Architecture: dilated residual 1D convolutions with skip connections
Activation: LeakyReLU (slope 0.1)
Optimizer: Adam
Loss: cross-entropy
Flanking context sizes: 80 / 400 / 2000 / 10000 nucleotides

Please refer to the OpenSpliceAI paper for the full training protocol and hardware details.

Citation

@article{chao2025openspliceai,
  author    = {Chao, Kuan-Hao and Mao, Alan and Liu, Anqi and Salzberg, Steven L and Pertea, Mihaela},
  title     = {OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species},
  journal   = {eLife},
  volume    = 14,
  pages     = {RP107454},
  year      = 2025,
  doi       = {10.7554/eLife.107454.3},
  publisher = {eLife Sciences Publications, Ltd}
}

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the OpenSpliceAI paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later

Downloads last month: 17

Safetensors

Model size

93.8k params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

multimolecule
/

openspliceai-honeybee-80nt

OpenSpliceAI

Disclaimer

Model Details

Variations

Model Specification

Links

Usage

Direct Use

RNA Splicing Site Prediction

Training Details

Training Data

Training Procedure

Pre-training

Citation

Contact

License