ENCODE ChromBPNet Atlas

As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use.

For more information about the models, see:

Main ENCODE 4 Paper
A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays (Deshpande et al., Zenodo 2025)
ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants (Pampari et al., bioRxiv 2024)

ChromBPNet model: DNASE in GM23248 (ENCSR217TAW)

Model: ChromBPNet
Assay: DNASE-seq
Experiment: ENCSR217TAW
Model annotation: ENCSR174HCP
Biosample: GM23248 (Full name: Homo sapiens GM23248)
Cell slim(s): connective-tissue-cell,fibroblast
Organ slim(s): limb,skin-of-body,connective-tissue
Developmental slim(s): ectoderm
System slim(s): integumental-system
Assembly: hg38

Directory structure

fold_0: Model of 5-fold cross-validation: Fold 0
- model.chrombpnet.fold_0.encid.h5: full chrombpnet model that combines both bias and corrected model in .h5 format
- model.chrombpnet_nobias.fold_0.encid.h5: bias-corrected accessibility model in .h5 format (Use for all biological discovery)
- model.bias_scaled.fold_0.encid.h5: bias model in .h5 format
- model.chrombpnet.fold_0.encid.tar: full chrombpnet model that combines both bias and corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet".
- model.chrombpnet_nobias.fold_0.encid.tar: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias".
- model.bias_scaled.fold_0.encid.tar: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled".
- logs.models.fold_0.encid: folder containing log files for training models
fold_1: Model of 5-fold coss-validation: Fold 1
fold_2: Model of 5-fold cross-validation: Fold 2
fold_3: Model of 5-fold cross-validation: Fold 3
fold_4: Model of 5-fold cross-validation: Fold 4

Instructions

1. Pseudocode for loading models in .h5 format

(1) Use the code in python after appropriately defining model_in_h5_format and inputs.
(2) inputs is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T].

import tensorflow as tf
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.models import load_model

custom_objects={"tf": tf}
get_custom_objects().update(custom_objects)

model=load_model(model_in_h5_format,compile=False)
outputs = model(inputs)

The list outputs consists of two elements. The first element has a shape of (N, 1000) and contains logit predictions for a 1000-base-pair output. The second element, with a shape of (N, 1), contains logcount predictions. To transform these predictions into per-base signals, follow the provided pseudo code lines below.

import numpy as np

def softmax(x, temp=1):
    norm_x = x - np.mean(x,axis=1, keepdims=True)
    return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
    
predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1)

2. Pseudocode for loading models in .tar format

(1) First untar the directory as follows tar -xvf model.tar.
(2) Use the code below in python after appropriately defining model_dir_untared and inputs.
(3) inputs is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT.

Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load

import tensorflow as tf

model = tf.saved_model.load('model_dir_untared')
outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')})

The variable outputs represents a dictionary containing two key-value pairs. The first key is logits_profile_predictions, holding a value with a shape of (N, 1000). This value corresponds to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``, is associated with a value of shape (N, 1), representing logcount predictions. To transform these predictions into per-base signals, utilize the provided pseudo code lines mentioned below.

import numpy as np
def softmax(x, temp=1):
    norm_x = x - np.mean(x,axis=1, keepdims=True)
    return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
    
predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1)

Docker image to load and use the models

https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1)

Code for ChromBPNet

https://github.com/kundajelab/chrombpnet/

License & citation

External data users may freely download, analyze and publish results based on any ENCODE data without restrictions.

Released under the ENCODE data-use policy. Please cite the ENCODE Project Consortium and the model software: ChromBPNet (Pampari et al., bioRxiv 2024).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kundajelab/encode-chrombpnet-DNASE-ENCSR217TAW-ENCSR174HCP

ENCODE ChromBPNet models

Collection

ENCODE ChromBPNet models trained on 1,512 DNAse-seq and ATAC-seq experiments across 408 biosamples • 1512 items • Updated 20 days ago