File size: 7,610 Bytes

1d011cb

---
language: dna
tags:
  - Biology
  - DNA
license: agpl-3.0
library_name: multimolecule
---

# ProCapNet

Base-resolution convolutional neural network for predicting PRO-cap transcription-initiation signal from DNA sequence.

## Disclaimer

This is an UNOFFICIAL implementation of [Dissecting the cis-regulatory syntax of transcription initiation with deep learning](https://doi.org/10.1101/2024.05.28.596138) by Kelly Cochran et al.

The OFFICIAL repository of ProCapNet is at [kundajelab/ProCapNet](https://github.com/kundajelab/ProCapNet).

> [!WARNING]
> The reference publication is a **bioRxiv preprint**: DOI [10.1101/2024.05.28.596138](https://doi.org/10.1101/2024.05.28.596138).

> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

**The team releasing ProCapNet did not write this model card for this model so this model card has been written by the MultiMolecule team.**

## Model Details

ProCapNet is a convolutional neural network (CNN) trained to predict base-resolution PRO-cap transcription-initiation signal from primary DNA sequence. Its architecture is largely adapted from Jacob Schreiber's `bpnet-lite` and shares BPNet's dilated-convolution backbone and profile/count factorization. It uses a convolutional motif stem followed by a stack of dilated residual convolutions that aggregate ~1 kb of genomic context, and is trained mappability-aware.

ProCapNet predicts a single base-resolution PRO-cap task whose output is **two-stranded** (plus / minus strand) and factorized into two terminal branches that share the dilated-convolution backbone:

- a **profile** branch predicting the _shape_ of the initiation signal as per-position, two-stranded multinomial logits, trained with a multinomial negative log-likelihood. Unlike single-stranded BPNet, the multinomial is **joint over both strands and all positions** (the plus / minus strands share one total count);
- a **count** branch predicting the _total magnitude_ of the signal as a single strand-merged scalar (in log space), trained with mean-squared error on `log(count + 1)`.

The usable base-resolution prediction recombines the two branches as `softmax(profile_logits, strands & positions) * exp(count_logits)`, exposed via `ProCapNetForProfilePrediction.postprocess`. Please refer to the [Training Details](#training-details) section for more information on the training process.

### Model Specification

| Input Length | Profile Length | Num Layers | Hidden Size | Num Parameters (M) | FLOPs (G) | MACs (G) |
| ------------ | -------------- | ---------- | ----------- | ------------------ | --------- | -------- |
| 2114         | 1000           | 9          | 512         | 6.43               | 27.17     | 13.58    |

FLOPs and MACs are measured on the canonical 2114 bp ProCapNet input window.

### Links

- **Code**: [multimolecule.procapnet](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/procapnet)
- **Weights**: [multimolecule/procapnet](https://huggingface.co/multimolecule/procapnet)
- **Data**: [K562 PRO-cap (ENCODE ENCSR261KBX)](https://www.encodeproject.org/experiments/ENCSR261KBX/)
- **Paper**: [Dissecting the cis-regulatory syntax of transcription initiation with deep learning](https://doi.org/10.1101/2024.05.28.596138)
- **Developed by**: Kelly Cochran, Melody Yin, Anika Mantripragada, Jacob Schreiber, Georgi K. Marinov, Sagar R. Shah, Haiyuan Yu, John T. Lis, Anshul Kundaje
- **Original Repository**: [kundajelab/ProCapNet](https://github.com/kundajelab/ProCapNet)

## Usage

The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:

```bash
pip install multimolecule
```

### Direct Use

You can use this model directly to predict PRO-cap transcription-initiation profiles of a DNA sequence:

```python
>>> from multimolecule import DnaTokenizer, ProCapNetForProfilePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/procapnet")
>>> model = ProCapNetForProfilePrediction.from_pretrained("multimolecule/procapnet")
>>> output = model(**tokenizer(("ACGT" * 529)[:2114], return_tensors="pt"))

>>> output.keys()
odict_keys(['profile_logits', 'count_logits'])

>>> output["profile_logits"].shape
torch.Size([1, 1000, 2])

>>> output["count_logits"].shape
torch.Size([1, 1])

>>> track = model.postprocess(output)
>>> track.shape
torch.Size([1, 1000, 2])
```

The recombined `track` is the usable base-resolution prediction. The last dimension stacks the `num_strands` (plus, minus) PRO-cap signal predictions.

## Training Details

ProCapNet was trained to predict the base-resolution, two-stranded PRO-cap transcription-initiation signal in human cell lines (the default converted checkpoint is the K562 model).

### Training Data

The published ProCapNet models were trained on PRO-cap signal using ~2 kb genomic windows. The default converted K562 model is distributed via the ENCODE portal as file [ENCFF976FHE](https://www.encodeproject.org/files/ENCFF976FHE/) and was trained on K562 PRO-cap experiment [ENCSR261KBX](https://www.encodeproject.org/experiments/ENCSR261KBX/) (PyTorch `state_dict`, 7 cross-validation folds; fold 0 is converted as the default). Training and test regions, observed signal tracks, and contribution scores are distributed through the same ENCODE release.

### Training Procedure

#### Training

The model was trained with a composite loss: a (strand-merged) multinomial negative log-likelihood on the per-position, two-stranded profile shape plus a mean-squared-error regression on `log(count + 1)` total counts.

- Backbone: 1 motif convolution (512 filters, kernel 21, ReLU) + 8 dilated residual convolutions (512 filters, kernel 3, dilations 2, 4, 8, …, 256, ReLU)
- Profile head: convolution (kernel 75) producing per-position, two-stranded logits
- Count head: global average pooling + linear layer producing a single strand-merged log-count scalar
- Optimizer: Adam
- Training is mappability-aware

## Citation

```bibtex
@article{cochran2024procapnet,
  author    = {Cochran, Kelly and Yin, Melody and Mantripragada, Anika and Schreiber, Jacob and Marinov, Georgi K. and Shah, Sagar R. and Yu, Haiyuan and Lis, John T. and Kundaje, Anshul},
  title     = {Dissecting the cis-regulatory syntax of transcription initiation with deep learning},
  journal   = {bioRxiv},
  year      = 2024,
  doi       = {10.1101/2024.05.28.596138},
  note      = {Preprint}
}
```

> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

```bibtex
@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}
```

## Contact

Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.

Please contact the authors of the [ProCapNet paper](https://doi.org/10.1101/2024.05.28.596138) for questions or comments on the paper/model.

## License

This model implementation is licensed under the [GNU Affero General Public License](license.md).

For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).

```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```