| --- |
| language: dna |
| tags: |
| - Biology |
| - DNA |
| license: agpl-3.0 |
| library_name: multimolecule |
| --- |
| |
| # Xpresso |
|
|
| Deep convolutional neural network for predicting mRNA abundance directly from genomic promoter sequence. |
|
|
| ## Disclaimer |
|
|
| This is an UNOFFICIAL implementation of [Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks](https://doi.org/10.1016/j.celrep.2020.107663) by Vikram Agarwal et al. |
|
|
| The OFFICIAL repository of Xpresso is at [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso). |
|
|
| > [!TIP] |
| > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. |
|
|
| **The team releasing Xpresso did not write this model card for this model so this model card has been written by the MultiMolecule team.** |
|
|
| ## Model Details |
|
|
| Xpresso is a deep convolutional neural network (CNN) that predicts steady-state mRNA expression level directly from genomic sequence. It consumes a promoter window of roughly 10.5 kb centered on the transcription start site (TSS), processes it through a stack of 1D convolution + max-pooling blocks, flattens the result, concatenates a small set of auxiliary numeric mRNA half-life features, and passes the combined representation through fully-connected layers to predict a single scalar expression value. Please refer to the [Training Details](#training-details) section for more information on the training process. |
|
|
| ### Model Specification |
|
|
| | Input Length | Conv Blocks | Hidden Size | Auxiliary Features | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | |
| | ------------ | ----------- | ----------- | ------------------ | ------------------ | --------- | -------- | -------------- | |
| | 10,500 | 2 | 2 | 6 | 0.11 | 0.11 | 0.05 | 10,500 | |
|
|
| ### Links |
|
|
| - **Code**: [multimolecule.xpresso](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/xpresso) |
| - **Weights**: [multimolecule/xpresso](https://huggingface.co/multimolecule/xpresso) |
| - **Paper**: [Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks](https://doi.org/10.1016/j.celrep.2020.107663) |
| - **Developed by**: Vikram Agarwal, Jay Shendure |
| - **Original Repository**: [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso) |
|
|
| ## Usage |
|
|
| The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: |
|
|
| ```bash |
| pip install multimolecule |
| ``` |
|
|
| ### Direct Use |
|
|
| #### mRNA Expression Prediction |
|
|
| You can use this model directly to predict the mRNA expression of a promoter sequence together with its auxiliary mRNA half-life features: |
|
|
| ```python |
| >>> import torch |
| >>> from multimolecule import DnaTokenizer, XpressoForSequencePrediction |
| |
| >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso") |
| >>> model = XpressoForSequencePrediction.from_pretrained("multimolecule/xpresso") |
| >>> input = tokenizer("ACGTACGTACGTACGT", return_tensors="pt") |
| >>> features = torch.randn(1, model.config.num_features) |
| >>> output = model(**input, features=features) |
| |
| >>> output.logits.shape |
| torch.Size([1, 1]) |
| ``` |
|
|
| The auxiliary half-life features are passed through the `features` argument as a float tensor of shape `(batch_size, num_features)`. Models configured with a non-zero `num_features` require this tensor; models configured with `num_features=0` do not accept it. |
|
|
| ## Training Details |
|
|
| Xpresso was trained to predict steady-state mRNA expression levels (median across tissues/cell lines) from genomic promoter sequence. |
|
|
| ### Training Data |
|
|
| Xpresso was trained on human and mouse genes, using promoter sequences (~10.5 kb windows centered on the TSS) together with mRNA half-life features derived from gene-body and UTR properties. Expression targets are log-transformed median mRNA levels across tissues. |
|
|
| ### Training Procedure |
|
|
| #### Pre-training |
|
|
| The model was trained to minimize a mean-squared-error loss between predicted and observed log mRNA expression values. |
|
|
| - Optimizer: Adam |
| - Loss: Mean squared error |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{agarwal2020predicting, |
| author = {Agarwal, Vikram and Shendure, Jay}, |
| journal = {Cell Reports}, |
| number = 7, |
| pages = {107663}, |
| publisher = {Elsevier BV}, |
| title = {Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks}, |
| volume = 31, |
| year = 2020 |
| } |
| ``` |
|
|
| > [!NOTE] |
| > The artifacts distributed in this repository are part of the MultiMolecule project. |
| > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: |
|
|
| ```bibtex |
| @software{chen_2024_12638419, |
| author = {Chen, Zhiyuan and Zhu, Sophia Y.}, |
| title = {MultiMolecule}, |
| doi = {10.5281/zenodo.12638419}, |
| publisher = {Zenodo}, |
| url = {https://doi.org/10.5281/zenodo.12638419}, |
| year = 2024, |
| month = may, |
| day = 4 |
| } |
| ``` |
|
|
| ## Contact |
|
|
| Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. |
|
|
| Please contact the authors of the [Xpresso paper](https://doi.org/10.1016/j.celrep.2020.107663) for questions or comments on the paper/model. |
|
|
| ## Known Limitations |
|
|
| - The released artifact ports the upstream `humanMedian` Keras weights; other upstream variants (`K562`, `GM12878`, `mESC`, `mouseMedian`) share the same architecture and can be converted with the same converter. |
| - Xpresso requires a fixed-length promoter window; shorter inputs are right-padded and longer inputs are center-cropped to `input_length`. |
|
|
| ## License |
|
|
| This model implementation is licensed under the [GNU Affero General Public License](license.md). |
|
|
| For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). |
|
|
| ```spdx |
| SPDX-License-Identifier: AGPL-3.0-or-later |
| ``` |
|
|