Instructions to use multimolecule/aparent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MultiMolecule
How to use multimolecule/aparent with MultiMolecule:
pip install multimolecule
from multimolecule import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("multimolecule/aparent") model = AutoModel.from_pretrained("multimolecule/aparent") - Notebooks
- Google Colab
- Kaggle
File size: 7,021 Bytes
8b5915f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | ---
language: dna
tags:
- Biology
- DNA
- RNA
license: agpl-3.0
library_name: multimolecule
---
# APARENT
Convolutional neural network for predicting human 3'UTR Alternative Polyadenylation (APA) from sequence.
## Disclaimer
This is an UNOFFICIAL implementation of [A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation](https://doi.org/10.1016/j.cell.2019.04.046) by Nicholas Bogard, Johannes Linder et al.
The OFFICIAL repository of APARENT is at [johli/aparent](https://github.com/johli/aparent).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing APARENT did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
APARENT (APA REgression NeT) is a convolutional neural network trained on more than 3.5 million randomized 3'UTR poly-A signals expressed on mini-gene reporters in HEK293. Given a fixed-length 205 nt 3'UTR/polyA sequence, APARENT predicts the alternative-polyadenylation isoform proportion (a scalar) and a positional cleavage distribution. The model is primarily used to score the impact of genetic variants on APA regulation and to engineer new polyadenylation signals. Please refer to the [Training Details](#training-details) section for more information on the training process.
This MultiMolecule port converts the base, non-normalised checkpoint (`aparent_large_lessdropout_all_libs_no_sampleweights.h5`) that the original authors recommend for isoform and variant-effect prediction.
### Architecture
- Input: fixed-length 205 nt one-hot sequence.
- `Conv1d` (96 filters, kernel 8) + ReLU, spanning the full nucleotide dimension.
- `MaxPool1d` (window 2).
- `Conv1d` (128 filters, kernel 6) + ReLU.
- Flatten (length-major, channel-minor) concatenated with the upstream distal-PAS scalar.
- `Linear` (512) + ReLU + Dropout.
- `Linear` (256) + ReLU + Dropout — the shared sequence representation (`pooler_output`).
- Two output layers consuming the shared representation concatenated with the upstream library one-hot:
- isoform proportion: `Linear` (1), sigmoid.
- cleavage distribution: `Linear` (206), softmax.
The MultiMolecule `AparentForSequencePrediction` exposes the upstream sequence-level APA isoform score. The upstream positional cleavage distribution remains available on `AparentModel` as `cleavage_logits`. The upstream library one-hot and distal-PAS scalar are rebuilt as deterministic constants matching the upstream default encoder.
### Model Specification
| Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| ---------- | ----------- | ------------------ | --------- | -------- | -------------- |
| 4 | 256 | 6.43 | 0.03 | 0.01 | 205 |
### Links
- **Code**: [multimolecule.aparent](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/aparent)
- **Weights**: [multimolecule/aparent](https://huggingface.co/multimolecule/aparent)
- **Paper**: [A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation](https://doi.org/10.1016/j.cell.2019.04.046)
- **Developed by**: Nicholas Bogard, Johannes Linder, Alexander B. Rosenberg, Georg Seelig
- **Original Repository**: [johli/aparent](https://github.com/johli/aparent)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### APA Isoform Prediction
You can use this model directly to predict the APA isoform proportion of a 3'UTR/polyA sequence:
```python
>>> from multimolecule import DnaTokenizer, AparentForSequencePrediction
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/aparent")
>>> model = AparentForSequencePrediction.from_pretrained("multimolecule/aparent")
>>> output = model(**tokenizer("ACGTACGTACGT", return_tensors="pt"))
>>> output.keys()
odict_keys(['logits'])
```
The full upstream isoform and cleavage outputs are available on the backbone:
```python
>>> from multimolecule import DnaTokenizer, AparentModel
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/aparent")
>>> model = AparentModel.from_pretrained("multimolecule/aparent")
>>> output = model(**tokenizer("ACGTACGTACGT", return_tensors="pt"))
>>> output.keys()
odict_keys(['pooler_output', 'isoform_logits', 'cleavage_logits'])
```
## Training Details
APARENT was trained to jointly predict the APA isoform proportion and the positional cleavage distribution of randomized 3'UTR poly-A signals.
### Training Data
APARENT was trained on more than 3.5 million randomized 3'UTR poly-A signal sequences expressed on mini-gene reporters in HEK293 cells (a massively parallel reporter assay, MPRA). The raw sequencing data for the 3'UTR MPRA libraries are available at GEO accession GSE113849.
The converted checkpoint (`aparent_large_lessdropout_all_libs_no_sampleweights.h5`) was trained on all MPRA libraries (no libraries held out) to produce the best general-purpose APA predictor; it differs from the per-library held-out model evaluated in the paper.
### Training Procedure
#### Pre-training
The model was trained to minimize a combined objective: a sigmoid KL-divergence on the isoform proportion and a KL-divergence on the positional cleavage distribution, weighted equally.
## Citation
```bibtex
@article{bogard2019adeep,
author = {Bogard, Nicholas and Linder, Johannes and Rosenberg, Alexander B. and Seelig, Georg},
title = {A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation},
journal = {Cell},
volume = {178},
number = {1},
pages = {91--106.e23},
year = {2019},
publisher = {Elsevier BV},
doi = {10.1016/j.cell.2019.04.046}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [APARENT paper](https://doi.org/10.1016/j.cell.2019.04.046) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```
|