| --- |
| license: mit |
| library_name: chrombpnet |
| tags: |
| - encode |
| - chrombpnet |
| - chromatin-accessibility |
| - DNASE |
| - t-cell |
| - hg38 |
| --- |
| # ENCODE ChromBPNet Atlas |
| As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use. |
|
|
| For more information about the models, see: |
| - Main ENCODE 4 Paper |
| - [A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays](https://doi.org/10.5281/zenodo.17123347) |
| - [ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants](https://doi.org/10.1101/2024.12.25.630221) |
|
|
| ## ChromBPNet model: DNASE in activated CD4-positive, alpha-beta T cell (ENCSR679EFH) |
| - Model: ChromBPNet |
| - Assay: DNASE-seq |
| - Experiment: [ENCSR679EFH](https://www.encodeproject.org/experiments/ENCSR679EFH/) |
| - Model annotation: [ENCSR434QWM](https://www.encodeproject.org/annotations/ENCSR434QWM/) |
| - Biosample: activated CD4-positive, alpha-beta T cell (Homo sapiens activated CD4-positive, alpha-beta T cell female adult (33 years) treated with anti-CD3 and anti-CD28 coated beads for 24 hours) |
| - Cell slim(s): T-cell,leukocyte,hematopoietic-cell,CD4+-T-cell |
| - Organ slim(s): bodily-fluid,blood |
| - Developmental slim(s): mesoderm,endoderm |
| - System slim(s): immune-system |
| - Assembly: hg38 |
|
|
| ## Directory structure |
| - `fold_0`: Model: Cross-validation fold: Fold 0 |
| - `model.chrombpnet.fold_0.encid.h5`: full chrombpnet model that combines both bias and corrected model in .h5 format |
| - `model.chrombpnet_nobias.fold_0.encid.h5`: bias-corrected accessibility model in .h5 format (Use for all biological discovery) |
| - `model.bias_scaled.fold_0.encid.h5`: bias model in .h5 format |
| - `model.chrombpnet.fold_0.encid.tar`: full chrombpnet model that combines both bias and corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet". |
| - `model.chrombpnet_nobias.fold_0.encid.tar`: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias". |
| - `model.bias_scaled.fold_0.encid.tar`: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled". |
| - `logs.models.fold_0.encid`: folder containing log files for training models |
| - `fold_1`: Model: Cross-validation fold: Fold 1 |
| - `fold_2`: Model: Cross-validation fold: Fold 2 |
| - `fold_3`: Model: Cross-validation fold: Fold 3 |
| - `fold_4`: Model: Cross-validation fold: Fold 4 |
|
|
| # Instructions |
| ## (1) Pseudocode for loading models in .h5 format |
|
|
| (1) Use the code in python after appropriately defining `model_in_h5_format` and `inputs`. |
| (2) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the |
| number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T]. |
|
|
| ```python |
| import tensorflow as tf |
| from tensorflow.keras.utils import get_custom_objects |
| from tensorflow.keras.models import load_model |
| |
| custom_objects={"tf": tf} |
| get_custom_objects().update(custom_objects) |
| |
| model=load_model(model_in_h5_format,compile=False) |
| outputs = model(inputs) |
| ``` |
|
|
| The list `outputs` consists of two elements. The first element has a shape of (N, 1000) and |
| contains logit predictions for a 1000-base-pair output. The second element, with a shape of |
| (N, 1), contains logcount predictions. To transform these predictions into per-base signals, |
| follow the provided pseudo code lines below. |
|
|
| ```python |
| import numpy as np |
| |
| def softmax(x, temp=1): |
| norm_x = x - np.mean(x,axis=1, keepdims=True) |
| return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True) |
| |
| predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1) |
| ``` |
|
|
| ## (2) Pseudocode for loading models in .tar format |
|
|
| (1) First untar the directory as follows `tar -xvf model.tar` |
| (2) Use the code below in python after appropriately defining `model_dir_untared` and `inputs` |
| (3) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number |
| of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT. |
|
|
| Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load |
|
|
| ```python |
| import tensorflow as tf |
| |
| model = tf.saved_model.load('model_dir_untared') |
| outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')}) |
| ``` |
|
|
| The variable `outputs` represents a dictionary containing two key-value pairs. The first key |
| is `logits_profile_predictions`, holding a value with a shape of (N, 1000). This value corresponds |
| to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``, |
| is associated with a value of shape (N, 1), representing logcount predictions. To transform these |
| predictions into per-base signals, utilize the provided pseudo code lines mentioned below. |
|
|
| ```python |
| import numpy as np |
| def softmax(x, temp=1): |
| norm_x = x - np.mean(x,axis=1, keepdims=True) |
| return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True) |
| |
| predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1) |
| ``` |
|
|
| ## Docker image to load and use the models |
| https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1) |
|
|
| ## Code for ChromBPNet |
| - https://github.com/kundajelab/chrombpnet/ |
|
|
| # License & citation |
| External data users may freely download, analyze and publish results based on any ENCODE data without restrictions. |
|
|
| Released under the [ENCODE data-use policy](https://www.encodeproject.org/about/data-use-policy/). Please cite the ENCODE Project Consortium and the model software: [ChromBPNet](https://github.com/kundajelab/chrombpnet) (Pampari et al., bioRxiv 2024). |