ENCODE ChromBPNet Atlas
As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use.
For more information about the models, see:
- Main ENCODE 4 Paper
- A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays
- ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants
ChromBPNet model: DNASE in GM23248 (ENCSR217TAW)
- Model: ChromBPNet
- Assay: DNASE-seq
- Experiment: ENCSR217TAW
- Model annotation: ENCSR174HCP
- Biosample: GM23248 (Homo sapiens GM23248)
- Cell slim(s): connective-tissue-cell,fibroblast
- Organ slim(s): limb,skin-of-body,connective-tissue
- Developmental slim(s): ectoderm
- System slim(s): integumental-system
- Assembly: hg38
Directory structure
fold_0: Model: Cross-validation fold: Fold 0model.chrombpnet.fold_0.encid.h5: full chrombpnet model that combines both bias and corrected model in .h5 formatmodel.chrombpnet_nobias.fold_0.encid.h5: bias-corrected accessibility model in .h5 format (Use for all biological discovery)model.bias_scaled.fold_0.encid.h5: bias model in .h5 formatmodel.chrombpnet.fold_0.encid.tar: full chrombpnet model that combines both bias and corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet".model.chrombpnet_nobias.fold_0.encid.tar: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias".model.bias_scaled.fold_0.encid.tar: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled".logs.models.fold_0.encid: folder containing log files for training models
fold_1: Model: Cross-validation fold: Fold 1fold_2: Model: Cross-validation fold: Fold 2fold_3: Model: Cross-validation fold: Fold 3fold_4: Model: Cross-validation fold: Fold 4
Instructions
(1) Pseudocode for loading models in .h5 format
(1) Use the code in python after appropriately defining model_in_h5_format and inputs.
(2) inputs is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the
number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T].
import tensorflow as tf
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.models import load_model
custom_objects={"tf": tf}
get_custom_objects().update(custom_objects)
model=load_model(model_in_h5_format,compile=False)
outputs = model(inputs)
The list outputs consists of two elements. The first element has a shape of (N, 1000) and
contains logit predictions for a 1000-base-pair output. The second element, with a shape of
(N, 1), contains logcount predictions. To transform these predictions into per-base signals,
follow the provided pseudo code lines below.
import numpy as np
def softmax(x, temp=1):
norm_x = x - np.mean(x,axis=1, keepdims=True)
return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1)
(2) Pseudocode for loading models in .tar format
(1) First untar the directory as follows tar -xvf model.tar
(2) Use the code below in python after appropriately defining model_dir_untared and inputs
(3) inputs is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number
of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT.
Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load
import tensorflow as tf
model = tf.saved_model.load('model_dir_untared')
outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')})
The variable outputs represents a dictionary containing two key-value pairs. The first key
is logits_profile_predictions, holding a value with a shape of (N, 1000). This value corresponds
to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``,
is associated with a value of shape (N, 1), representing logcount predictions. To transform these
predictions into per-base signals, utilize the provided pseudo code lines mentioned below.
import numpy as np
def softmax(x, temp=1):
norm_x = x - np.mean(x,axis=1, keepdims=True)
return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1)
Docker image to load and use the models
https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1)
Code for ChromBPNet
License & citation
External data users may freely download, analyze and publish results based on any ENCODE data without restrictions.
Released under the ENCODE data-use policy. Please cite the ENCODE Project Consortium and the model software: ChromBPNet (Pampari et al., bioRxiv 2024).