File size: 5,954 Bytes

b587ebf

---
language: dna
tags:
  - Biology
  - DNA
license: agpl-3.0
library_name: multimolecule
---

# Xpresso

Deep convolutional neural network for predicting mRNA abundance directly from genomic promoter sequence.

## Disclaimer

This is an UNOFFICIAL implementation of [Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks](https://doi.org/10.1016/j.celrep.2020.107663) by Vikram Agarwal et al.

The OFFICIAL repository of Xpresso is at [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing Xpresso did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

Xpresso is a deep convolutional neural network (CNN) that predicts steady-state mRNA expression level directly from genomic sequence. It consumes a promoter window of roughly 10.5 kb centered on the transcription start site (TSS), processes it through a stack of 1D convolution + max-pooling blocks, flattens the result, concatenates a small set of auxiliary numeric mRNA half-life features, and passes the combined representation through fully-connected layers to predict a single scalar expression value. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

| Input Length | Conv Blocks | Hidden Size | Auxiliary Features | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| ------------ | ----------- | ----------- | ------------------ | ------------------ | --------- | -------- | -------------- |
| 10,500       | 2           | 2           | 6                  | 0.11               | 0.11      | 0.05     | 10,500         |

### Links

- **Code**: [multimolecule.xpresso](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/xpresso)
- **Weights**: [multimolecule/xpresso](https://huggingface.co/multimolecule/xpresso)
- **Paper**: [Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks](https://doi.org/10.1016/j.celrep.2020.107663)
- **Developed by**: Vikram Agarwal, Jay Shendure
- **Original Repository**: [vagarwal87/Xpresso](https://github.com/vagarwal87/Xpresso)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

#### mRNA Expression Prediction

You can use this model directly to predict the mRNA expression of a promoter sequence together with its auxiliary mRNA half-life features:

```python
>>> import torch
>>> from multimolecule import DnaTokenizer, XpressoForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/xpresso")
>>> model = XpressoForSequencePrediction.from_pretrained("multimolecule/xpresso")
>>> input = tokenizer("ACGTACGTACGTACGT", return_tensors="pt")
>>> features = torch.randn(1, model.config.num_features)
>>> output = model(**input, features=features)

>>> output.logits.shape
torch.Size([1, 1])
```

The auxiliary half-life features are passed through the `features` argument as a float tensor of shape `(batch_size, num_features)`. Models configured with a non-zero `num_features` require this tensor; models configured with `num_features=0` do not accept it.

## Training Details

Xpresso was trained to predict steady-state mRNA expression levels (median across tissues/cell lines) from genomic promoter sequence.

### Training Data

Xpresso was trained on human and mouse genes, using promoter sequences (~10.5 kb windows centered on the TSS) together with mRNA half-life features derived from gene-body and UTR properties. Expression targets are log-transformed median mRNA levels across tissues.

### Training Procedure

#### Pre-training

The model was trained to minimize a mean-squared-error loss between predicted and observed log mRNA expression values.

- Optimizer: Adam
- Loss: Mean squared error

## Citation

```bibtex
@article{agarwal2020predicting,
  author    = {Agarwal, Vikram and Shendure, Jay},
  journal   = {Cell Reports},
  number    = 7,
  pages     = {107663},
  publisher = {Elsevier BV},
  title     = {Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks},
  volume    = 31,
  year      = 2020
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [Xpresso paper](https://doi.org/10.1016/j.celrep.2020.107663) for questions or comments on the paper/model.

## Known Limitations

- The released artifact ports the upstream `humanMedian` Keras weights; other upstream variants (`K562`, `GM12878`, `mESC`, `mouseMedian`) share the same architecture and can be converted with the same converter.
- Xpresso requires a fixed-length promoter window; shorter inputs are right-padded and longer inputs are center-cropped to `input_length`.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```