File size: 6,320 Bytes
91b25ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language: dna
tags:
  - Biology
  - DNA
  - Splicing
license: agpl-3.0
datasets:
  - multimolecule/gencode
library_name: multimolecule
---

# Pangolin

Convolutional neural network for predicting tissue-specific splice site strength from pre-mRNA sequences.

## Disclaimer

This is an UNOFFICIAL implementation of [Predicting RNA splicing from DNA sequence using Pangolin](https://doi.org/10.1186/s13059-022-02664-4) by Tony Zeng and Yang I. Li.

The OFFICIAL repository of Pangolin is at [tkzeng/Pangolin](https://github.com/tkzeng/Pangolin).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing Pangolin did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

Pangolin is a deep convolutional neural network (CNN) that predicts splice site strength from primary pre-mRNA sequence.
It extends the dilated-residual SpliceAI architecture to predict tissue-specific splice site usage, and is trained on splicing measurements derived from RNA-seq data across multiple tissues.
The network processes a one-hot encoded nucleotide sequence and, for each position, predicts a splice-site score and a splice-site usage score per tissue.
Pangolin is typically used to estimate the effect of genetic variants on splicing by scoring reference and alternate sequences and taking the difference.
Please refer to the [Training Details](#training-details) section for more information on the training process.

The official release distributes tissue-specific ensembles. MultiMolecule exposes the canonical v2 model as a single checkpoint that uses the three replicate networks for each of the four tissue groups (heart, liver, brain, and testis). Ensemble membership is an implementation detail and is not exposed in the downstream API.

### Model Specification

| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
| ---------- | ----------- | ------------------ | --------- | -------- |
| 16         | 32          | 8.36               | 168.85    | 84.04    |

- Num Layers / Hidden Size describe a single ensemble member; Num Parameters / FLOPs / MACs are for the full 12-member canonical ensemble.
- FLOPs and MACs are measured on a 100-nucleotide input.

### Links

- **Code**: [multimolecule.pangolin](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/pangolin)
- **Weights**: [multimolecule/pangolin](https://huggingface.co/multimolecule/pangolin)
- **Paper**: [Predicting RNA splicing from DNA sequence using Pangolin](https://doi.org/10.1186/s13059-022-02664-4)
- **Developed by**: Tony Zeng, Yang I. Li
- **Original Repository**: [tkzeng/Pangolin](https://github.com/tkzeng/Pangolin)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### RNA Splicing Site Prediction

You can use this model directly to predict per-nucleotide tissue-specific splice-site score and usage channels for a pre-mRNA sequence:

```python
>>> from multimolecule import DnaTokenizer, PangolinModel

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/pangolin")
>>> model = PangolinModel.from_pretrained("multimolecule/pangolin")
>>> output = model(tokenizer("AGCAGTCATTATGGCGAA", return_tensors="pt")["input_ids"])

>>> output.keys()
odict_keys(['last_hidden_state', 'logits'])
```

The `logits` tensor reproduces the original Pangolin output: for each of the four tissues, two splice-site score channels and one splice-site usage channel.

### Downstream Use

#### Token Prediction

You can fine-tune Pangolin for per-nucleotide splice site strength regression with [`PangolinForTokenPrediction`][multimolecule.models.PangolinForTokenPrediction], which adds a shared token prediction head on top of the backbone.

## Training Details

Pangolin was trained to predict tissue-specific splice site usage from primary pre-mRNA sequence.

### Training Data

Pangolin was trained on splice site usage derived from RNA-seq data in heart, liver, brain, and testis tissues from human and three other species, using gene annotations from [GENCODE](https://multimolecule.danling.org/datasets/gencode).
For each nucleotide whose splicing status was predicted, a sequence window centered on that nucleotide was used, with the flanking context padded with `N` (unknown nucleotide) when near transcript ends.

### Training Procedure

#### Pre-training

The model was trained to minimize a combination of cross-entropy loss over splice-site classification and a regression loss over splice-site usage, comparing predictions against measurements derived from RNA-seq.

- Optimizer: AdamW
- Learning rate scheduler: Step decay

## Citation

```bibtex
@article{zeng2022predicting,
  author    = {Zeng, Tony and Li, Yang I.},
  title     = {Predicting RNA splicing from DNA sequence using Pangolin},
  journal   = {Genome Biology},
  volume    = {23},
  number    = {1},
  pages     = {103},
  year      = {2022},
  doi       = {10.1186/s13059-022-02664-4},
  publisher = {BioMed Central}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [Pangolin paper](https://doi.org/10.1186/s13059-022-02664-4) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```