Malinois

Convolutional neural network for predicting cell-type-targeting cis-regulatory element (CRE) activity from DNA sequence.

Disclaimer

This is an UNOFFICIAL implementation of Machine-guided design of cell-type-targeting cis-regulatory elements by Sager J. Gosai, Rodrigo I. Castro, et al.

The OFFICIAL repository of Malinois is at sjgosai/boda2.

The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.

The team releasing Malinois did not write this model card for this model so this model card has been written by the MultiMolecule team.

Model Details

Malinois is a deep convolutional neural network (a tuned Basset-style "branched" architecture) trained to quantitatively predict cell-type-informed CRE activity from ~200 bp DNA sequences measured by a massively parallel reporter assay (MPRA). The model emits three regression outputs, one per human cell line: K562, HepG2 and SK-N-SH (in that order).

The architecture consists of three convolutional blocks (Conv1D + BatchNorm + ReLU + MaxPool), one shared fully-connected block (Linear + BatchNorm + ReLU + Dropout), and a branched tower of three grouped-linear layers that maintains an independent parameter set per cell line, followed by a per-cell-line output projection. Please refer to the Training Details section for more information on the training process.

Upstream Malinois consumes a fixed input_length (600 bp) sequence: each ~200 bp candidate CRE is centered and padded with fixed MPRA plasmid flanks (MPRA_UPSTREAM / MPRA_DOWNSTREAM) before the convolution stack. The flank padding is part of the data pipeline, not the model, so callers must reproduce it to match upstream predictions.

Model Specification

Num Layers Hidden Size Num Parameters (M) FLOPs (M) MACs (M) Max Num Tokens
8 420 4.11 332.95 165.70 600

Links

Usage

The model file depends on the multimolecule library. You can install it using pip:

pip install multimolecule

Direct Use

CRE Activity Prediction

You can use this model directly to predict the cell-type-informed CRE activity (K562, HepG2, SK-N-SH) of a sequence. Upstream Malinois pads each ~200 bp candidate to 600 bp with fixed MPRA plasmid flanks before inference; the example below uses a pre-padded 600 bp sequence:

>>> import torch
>>> from multimolecule import DnaTokenizer, MalinoisForSequencePrediction

>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/malinois")
>>> model = MalinoisForSequencePrediction.from_pretrained("multimolecule/malinois")
>>> sequence = "ACGT" * 150
>>> output = model(**tokenizer(sequence, return_tensors="pt"))

>>> output.logits.shape
torch.Size([1, 3])

Training Details

Malinois was trained to predict quantitative, cell-type-informed CRE activity from DNA sequence.

Training Data

Malinois was trained on a lentiMPRA dataset measuring the regulatory activity of ~200 bp sequences across three human cell lines (K562, HepG2 and SK-N-SH). Each training example is a sequence with three continuous activity values (log2 fold-change over input), one per cell line. Genomic sequences were split by chromosome into training, validation, and test sets to avoid sequence leakage.

Training Procedure

Pre-training

The model was trained to minimize an L1 + KL-divergence mixed loss between predicted and measured cell-type CRE activities, with the architecture and training hyperparameters selected by Bayesian optimization.

  • Optimizer: Adam
  • Loss: L1 + KL-divergence mixed loss
  • Input length: 600 bp (200 bp candidate + fixed MPRA plasmid flanks)
  • Outputs: 3 (K562, HepG2, SK-N-SH)
  • Early stopping on validation loss

Citation

@article{gosai2024malinois,
  author    = {Gosai, Sager J. and Castro, Rodrigo I. and Fuentes, Natalia and Butts, John C. and Mouri, Kousuke and Alasoadura, Michael and Kales, Susan and Nguyen, Thanh Thanh L. and Noche, Ramil R. and Rao, Arya S. and Joy, Mary T. and Sabeti, Pardis C. and Reilly, Steven K. and Tewhey, Ryan},
  journal   = {Nature},
  month     = oct,
  number    = 8036,
  pages     = {1211--1220},
  publisher = {Springer Science and Business Media LLC},
  title     = {Machine-guided design of cell-type-targeting cis-regulatory elements},
  volume    = 634,
  year      = 2024
}

The artifacts distributed in this repository are part of the MultiMolecule project. If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:

@software{chen_2024_12638419,
  author    = {Chen, Zhiyuan and Zhu, Sophia Y.},
  title     = {MultiMolecule},
  doi       = {10.5281/zenodo.12638419},
  publisher = {Zenodo},
  url       = {https://doi.org/10.5281/zenodo.12638419},
  year      = 2024,
  month     = may,
  day       = 4
}

Contact

Please use GitHub issues of MultiMolecule for any questions or comments on the model card.

Please contact the authors of the Malinois paper for questions or comments on the paper/model.

License

This model implementation is licensed under the GNU Affero General Public License.

For additional terms and clarifications, please refer to our License FAQ.

SPDX-License-Identifier: AGPL-3.0-or-later
Downloads last month
-
Safetensors
Model size
4.12M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using multimolecule/malinois 2