--- language: dna tags: - Biology - DNA license: agpl-3.0 library_name: multimolecule --- # Xpresso Deep convolutional neural network for predicting mRNA abundance directly from genomic promoter sequence. ## Disclaimer This is an UNOFFICIAL implementation of [Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks](https://doi.org/10.1016/j.celrep.2020.107663) by Vikram Agarwal et al. The OFFICIAL repository of Xpresso is at [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing Xpresso did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details Xpresso is a deep convolutional neural network (CNN) that predicts steady-state mRNA expression level directly from genomic sequence. It consumes a promoter window of roughly 10.5 kb centered on the transcription start site (TSS), processes it through a stack of 1D convolution + max-pooling blocks, flattens the result, concatenates a small set of auxiliary numeric mRNA half-life features, and passes the combined representation through fully-connected layers to predict a single scalar expression value. Please refer to the [Training Details](#training-details) section for more information on the training process. ### Model Specification | Input Length | Conv Blocks | Hidden Size | Auxiliary Features | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | | ------------ | ----------- | ----------- | ------------------ | ------------------ | --------- | -------- | -------------- | | 10,500 | 2 | 2 | 6 | 0.11 | 0.11 | 0.05 | 10,500 | ### Links - **Code**: [multimolecule.xpresso](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/xpresso) - **Weights**: [multimolecule/xpresso](https://huggingface.co/multimolecule/xpresso) - **Paper**: [Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks](https://doi.org/10.1016/j.celrep.2020.107663) - **Developed by**: Vikram Agarwal, Jay Shendure - **Original Repository**: [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### mRNA Expression Prediction You can use this model directly to predict the mRNA expression of a promoter sequence together with its auxiliary mRNA half-life features: ```python >>> import torch >>> from multimolecule import DnaTokenizer, XpressoForSequencePrediction >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso") >>> model = XpressoForSequencePrediction.from_pretrained("multimolecule/xpresso") >>> input = tokenizer("ACGTACGTACGTACGT", return_tensors="pt") >>> features = torch.randn(1, model.config.num_features) >>> output = model(**input, features=features) >>> output.logits.shape torch.Size([1, 1]) ``` The auxiliary half-life features are passed through the `features` argument as a float tensor of shape `(batch_size, num_features)`. Models configured with a non-zero `num_features` require this tensor; models configured with `num_features=0` do not accept it. ## Training Details Xpresso was trained to predict steady-state mRNA expression levels (median across tissues/cell lines) from genomic promoter sequence. ### Training Data Xpresso was trained on human and mouse genes, using promoter sequences (~10.5 kb windows centered on the TSS) together with mRNA half-life features derived from gene-body and UTR properties. Expression targets are log-transformed median mRNA levels across tissues. ### Training Procedure #### Pre-training The model was trained to minimize a mean-squared-error loss between predicted and observed log mRNA expression values. - Optimizer: Adam - Loss: Mean squared error ## Citation ```bibtex @article{agarwal2020predicting, author = {Agarwal, Vikram and Shendure, Jay}, journal = {Cell Reports}, number = 7, pages = {107663}, publisher = {Elsevier BV}, title = {Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks}, volume = 31, year = 2020 } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [Xpresso paper](https://doi.org/10.1016/j.celrep.2020.107663) for questions or comments on the paper/model. ## Known Limitations - The released artifact ports the upstream `humanMedian` Keras weights; other upstream variants (`K562`, `GM12878`, `mESC`, `mouseMedian`) share the same architecture and can be converted with the same converter. - Xpresso requires a fixed-length promoter window; shorter inputs are right-padded and longer inputs are center-cropped to `input_length`. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```