--- license: mit library_name: chrombpnet tags: - encode - chrombpnet - chromatin-accessibility - DNASE - t-cell - hg38 --- # ENCODE ChromBPNet Atlas As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use. For more information about the models, see: - Main ENCODE 4 Paper - [A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays](https://doi.org/10.5281/zenodo.17123347) (Deshpande et al., Zenodo 2025) - [ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants](https://doi.org/10.1101/2024.12.25.630221) (Pampari et al., bioRxiv 2024) ## ChromBPNet model: DNASE in activated CD4-positive, alpha-beta T cell (ENCSR679EFH) - Model: ChromBPNet - Assay: DNASE-seq - Experiment: [ENCSR679EFH](https://www.encodeproject.org/experiments/ENCSR679EFH/) - Model annotation: [ENCSR434QWM](https://www.encodeproject.org/annotations/ENCSR434QWM/) - Biosample: activated CD4-positive, alpha-beta T cell (Full name: Homo sapiens activated CD4-positive, alpha-beta T cell female adult (33 years) treated with anti-CD3 and anti-CD28 coated beads for 24 hours) - Cell slim(s): T-cell,leukocyte,hematopoietic-cell,CD4+-T-cell - Organ slim(s): bodily-fluid,blood - Developmental slim(s): mesoderm,endoderm - System slim(s): immune-system - Assembly: hg38 ## Directory structure - `fold_0`: Model of 5-fold cross-validation: Fold 0 - `model.chrombpnet.fold_0.encid.h5`: full chrombpnet model that combines both bias and corrected model in .h5 format - `model.chrombpnet_nobias.fold_0.encid.h5`: bias-corrected accessibility model in .h5 format (Use for all biological discovery) - `model.bias_scaled.fold_0.encid.h5`: bias model in .h5 format - `model.chrombpnet.fold_0.encid.tar`: full chrombpnet model that combines both bias and corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet". - `model.chrombpnet_nobias.fold_0.encid.tar`: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias". - `model.bias_scaled.fold_0.encid.tar`: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled". - `logs.models.fold_0.encid`: folder containing log files for training models - `fold_1`: Model of 5-fold coss-validation: Fold 1 - `fold_2`: Model of 5-fold cross-validation: Fold 2 - `fold_3`: Model of 5-fold cross-validation: Fold 3 - `fold_4`: Model of 5-fold cross-validation: Fold 4 # Instructions ## 1. Pseudocode for loading models in .h5 format (1) Use the code in python after appropriately defining `model_in_h5_format` and `inputs`. \ (2) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T]. ```python import tensorflow as tf from tensorflow.keras.utils import get_custom_objects from tensorflow.keras.models import load_model custom_objects={"tf": tf} get_custom_objects().update(custom_objects) model=load_model(model_in_h5_format,compile=False) outputs = model(inputs) ``` The list `outputs` consists of two elements. The first element has a shape of (N, 1000) and contains logit predictions for a 1000-base-pair output. The second element, with a shape of (N, 1), contains logcount predictions. To transform these predictions into per-base signals, follow the provided pseudo code lines below. ```python import numpy as np def softmax(x, temp=1): norm_x = x - np.mean(x,axis=1, keepdims=True) return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True) predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1) ``` ## 2. Pseudocode for loading models in .tar format (1) First untar the directory as follows `tar -xvf model.tar`. \ (2) Use the code below in python after appropriately defining `model_dir_untared` and `inputs`. \ (3) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT. Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load ```python import tensorflow as tf model = tf.saved_model.load('model_dir_untared') outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')}) ``` The variable `outputs` represents a dictionary containing two key-value pairs. The first key is `logits_profile_predictions`, holding a value with a shape of (N, 1000). This value corresponds to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``, is associated with a value of shape (N, 1), representing logcount predictions. To transform these predictions into per-base signals, utilize the provided pseudo code lines mentioned below. ```python import numpy as np def softmax(x, temp=1): norm_x = x - np.mean(x,axis=1, keepdims=True) return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True) predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1) ``` ## Docker image to load and use the models - https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1) ## Code for ChromBPNet - https://github.com/kundajelab/chrombpnet/ # License & citation External data users may freely download, analyze and publish results based on any ENCODE data without restrictions. Released under the [ENCODE data-use policy](https://www.encodeproject.org/about/data-use-policy/). Please cite the ENCODE Project Consortium and the model software: [ChromBPNet](https://github.com/kundajelab/chrombpnet) (Pampari et al., bioRxiv 2024).