ZhiyuanChen's picture
Upload folder using huggingface_hub
be7702c verified
---
language: dna
tags:
- Biology
- DNA
- Genomics
license: agpl-3.0
datasets:
- multimolecule/gencode
library_name: multimolecule
---
# OpenSpliceAI
Modular native-PyTorch reimplementation of SpliceAI for predicting pre-mRNA splice sites from primary DNA sequence.
## Disclaimer
This is an UNOFFICIAL implementation of [OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species](https://doi.org/10.7554/eLife.107454.3) by Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg and Mihaela Pertea.
The OFFICIAL repository of OpenSpliceAI is at [Kuanhao-Chao/OpenSpliceAI](https://github.com/Kuanhao-Chao/OpenSpliceAI).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing OpenSpliceAI did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
OpenSpliceAI is a deep dilated residual convolutional neural network that reimplements the SpliceAI architecture in native PyTorch. It predicts, for each nucleotide of a pre-mRNA transcript, whether the position is a splice acceptor, a splice donor, or neither. The model stacks dilated residual units with increasing kernel size and atrous rate so that a wide genomic context window contributes to each per-nucleotide prediction, while skip connections aggregate multi-scale features. OpenSpliceAI reproduces the predictive behavior of SpliceAI while providing an efficient, modular training pipeline that can be retrained on non-human species.
### Variations
OpenSpliceAI ships trained model families for human MANE and four non-human species. Each family provides four
flanking-context sizes. MultiMolecule publishes one seed (`rs10`) for each family/context pair; the other upstream seed
checkpoints are training replicates and are not exposed as separate model variants.
| Family | 80 nt | 400 nt | 2,000 nt | 10,000 nt |
| ------------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| MANE / human | [`openspliceai-mane-80nt`](https://huggingface.co/multimolecule/openspliceai-mane-80nt) | [`openspliceai-mane-400nt`](https://huggingface.co/multimolecule/openspliceai-mane-400nt) | [`openspliceai-mane-2000nt`](https://huggingface.co/multimolecule/openspliceai-mane-2000nt) | [`openspliceai-mane-10000nt`](https://huggingface.co/multimolecule/openspliceai-mane-10000nt) |
| Mouse | [`openspliceai-mouse-80nt`](https://huggingface.co/multimolecule/openspliceai-mouse-80nt) | [`openspliceai-mouse-400nt`](https://huggingface.co/multimolecule/openspliceai-mouse-400nt) | [`openspliceai-mouse-2000nt`](https://huggingface.co/multimolecule/openspliceai-mouse-2000nt) | [`openspliceai-mouse-10000nt`](https://huggingface.co/multimolecule/openspliceai-mouse-10000nt) |
| Zebrafish | [`openspliceai-zebrafish-80nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-80nt) | [`openspliceai-zebrafish-400nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-400nt) | [`openspliceai-zebrafish-2000nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-2000nt) | [`openspliceai-zebrafish-10000nt`](https://huggingface.co/multimolecule/openspliceai-zebrafish-10000nt) |
| Honeybee | [`openspliceai-honeybee-80nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-80nt) | [`openspliceai-honeybee-400nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-400nt) | [`openspliceai-honeybee-2000nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-2000nt) | [`openspliceai-honeybee-10000nt`](https://huggingface.co/multimolecule/openspliceai-honeybee-10000nt) |
| _Arabidopsis_ | [`openspliceai-arabidopsis-80nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-80nt) | [`openspliceai-arabidopsis-400nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-400nt) | [`openspliceai-arabidopsis-2000nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-2000nt) | [`openspliceai-arabidopsis-10000nt`](https://huggingface.co/multimolecule/openspliceai-arabidopsis-10000nt) |
### Model Specification
| Flanking Context | Residual Blocks | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
| ---------------- | --------------- | ----------- | ------------------ | --------- | -------- |
| 80 nt | 4 | 32 | 0.09 | 0.95 | 0.47 |
| 400 nt | 8 | 32 | 0.19 | 2.00 | 0.99 |
| 2,000 nt | 12 | 32 | 0.36 | 5.03 | 2.50 |
| 10,000 nt | 16 | 32 | 0.70 | 20.90 | 10.40 |
Model size is determined by flanking context and is shared across species for the same context. FLOPs and MACs are
reported for a single 5,000-nucleotide output sequence.
### Links
- **Code**: [multimolecule.openspliceai](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/openspliceai)
- **Weights**: See the 20 variant repositories listed above.
- **Data**: Human MANE/GENCODE for the MANE variants; species annotations follow the original OpenSpliceAI release.
- **Paper**: [OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species](https://doi.org/10.7554/eLife.107454.3)
- **Developed by**: Kuan-Hao Chao, Alan Mao, Anqi Liu, Steven L. Salzberg, Mihaela Pertea
- **Original Repository**: [Kuanhao-Chao/OpenSpliceAI](https://github.com/Kuanhao-Chao/OpenSpliceAI) (licensed GPL-3.0)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### RNA Splicing Site Prediction
You can use this model directly to predict the splice sites of a pre-mRNA sequence:
```python
>>> from multimolecule import DnaTokenizer, OpenSpliceAiForTokenPrediction
>>> model_id = "multimolecule/openspliceai-mouse-2000nt"
>>> tokenizer = DnaTokenizer.from_pretrained(model_id)
>>> model = OpenSpliceAiForTokenPrediction.from_pretrained(model_id)
>>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"])
>>> output.keys()
odict_keys(['logits'])
```
Each output position carries three logits corresponding to _neither_, _acceptor_, and _donor_.
## Training Details
OpenSpliceAI was trained to predict the location of splice donor and acceptor sites from primary DNA sequence, following the SpliceAI training methodology.
### Training Data
The MANE variants were trained on transcripts from the [GENCODE](https://multimolecule.danling.org/datasets/gencode)/MANE human reference annotation. The non-human variants use the species annotations released by OpenSpliceAI for mouse, zebrafish, honeybee, and _Arabidopsis_. For each predicted nucleotide, the model receives a flanking context of 80, 400, 2,000, or 10,000 nucleotides, split evenly across the two sides of the output sequence, with sequence ends padded with `N`. Annotated splice donor and acceptor sites serve as positive labels; all other positions are negative.
### Training Procedure
#### Pre-training
The model was trained to minimize a cross-entropy loss between predicted splice-site probabilities and the reference annotation.
- Architecture: dilated residual 1D convolutions with skip connections
- Activation: LeakyReLU (slope 0.1)
- Optimizer: Adam
- Loss: cross-entropy
- Flanking context sizes: 80 / 400 / 2000 / 10000 nucleotides
Please refer to the [OpenSpliceAI paper](https://doi.org/10.7554/eLife.107454.3) for the full training protocol and hardware details.
## Citation
```bibtex
@article{chao2025openspliceai,
author = {Chao, Kuan-Hao and Mao, Alan and Liu, Anqi and Salzberg, Steven L and Pertea, Mihaela},
title = {OpenSpliceAI: An efficient, modular implementation of SpliceAI enabling easy retraining on non-human species},
journal = {eLife},
volume = 14,
pages = {RP107454},
year = 2025,
doi = {10.7554/eLife.107454.3},
publisher = {eLife Sciences Publications, Ltd}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [OpenSpliceAI paper](https://doi.org/10.7554/eLife.107454.3) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```