File size: 5,944 Bytes
4fd83f8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | ---
license: mit
library_name: chrombpnet
tags:
- encode
- chrombpnet
- chromatin-accessibility
- DNASE
- t-cell
- hg38
---
# ENCODE ChromBPNet Atlas
As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use.
For more information about the models, see:
- Main ENCODE 4 Paper
- [A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays](https://doi.org/10.5281/zenodo.17123347)
- [ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants](https://doi.org/10.1101/2024.12.25.630221)
## ChromBPNet model: DNASE in activated CD4-positive, alpha-beta T cell (ENCSR679EFH)
- Model: ChromBPNet
- Assay: DNASE-seq
- Experiment: [ENCSR679EFH](https://www.encodeproject.org/experiments/ENCSR679EFH/)
- Model annotation: [ENCSR434QWM](https://www.encodeproject.org/annotations/ENCSR434QWM/)
- Biosample: activated CD4-positive, alpha-beta T cell (Homo sapiens activated CD4-positive, alpha-beta T cell female adult (33 years) treated with anti-CD3 and anti-CD28 coated beads for 24 hours)
- Cell slim(s): T-cell,leukocyte,hematopoietic-cell,CD4+-T-cell
- Organ slim(s): bodily-fluid,blood
- Developmental slim(s): mesoderm,endoderm
- System slim(s): immune-system
- Assembly: hg38
## Directory structure
- `fold_0`: Model: Cross-validation fold: Fold 0
- `model.chrombpnet.fold_0.encid.h5`: full chrombpnet model that combines both bias and corrected model in .h5 format
- `model.chrombpnet_nobias.fold_0.encid.h5`: bias-corrected accessibility model in .h5 format (Use for all biological discovery)
- `model.bias_scaled.fold_0.encid.h5`: bias model in .h5 format
- `model.chrombpnet.fold_0.encid.tar`: full chrombpnet model that combines both bias and corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet".
- `model.chrombpnet_nobias.fold_0.encid.tar`: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias".
- `model.bias_scaled.fold_0.encid.tar`: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled".
- `logs.models.fold_0.encid`: folder containing log files for training models
- `fold_1`: Model: Cross-validation fold: Fold 1
- `fold_2`: Model: Cross-validation fold: Fold 2
- `fold_3`: Model: Cross-validation fold: Fold 3
- `fold_4`: Model: Cross-validation fold: Fold 4
# Instructions
## (1) Pseudocode for loading models in .h5 format
(1) Use the code in python after appropriately defining `model_in_h5_format` and `inputs`.
(2) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the
number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T].
```python
import tensorflow as tf
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.models import load_model
custom_objects={"tf": tf}
get_custom_objects().update(custom_objects)
model=load_model(model_in_h5_format,compile=False)
outputs = model(inputs)
```
The list `outputs` consists of two elements. The first element has a shape of (N, 1000) and
contains logit predictions for a 1000-base-pair output. The second element, with a shape of
(N, 1), contains logcount predictions. To transform these predictions into per-base signals,
follow the provided pseudo code lines below.
```python
import numpy as np
def softmax(x, temp=1):
norm_x = x - np.mean(x,axis=1, keepdims=True)
return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1)
```
## (2) Pseudocode for loading models in .tar format
(1) First untar the directory as follows `tar -xvf model.tar`
(2) Use the code below in python after appropriately defining `model_dir_untared` and `inputs`
(3) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number
of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT.
Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load
```python
import tensorflow as tf
model = tf.saved_model.load('model_dir_untared')
outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')})
```
The variable `outputs` represents a dictionary containing two key-value pairs. The first key
is `logits_profile_predictions`, holding a value with a shape of (N, 1000). This value corresponds
to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``,
is associated with a value of shape (N, 1), representing logcount predictions. To transform these
predictions into per-base signals, utilize the provided pseudo code lines mentioned below.
```python
import numpy as np
def softmax(x, temp=1):
norm_x = x - np.mean(x,axis=1, keepdims=True)
return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1)
```
## Docker image to load and use the models
https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1)
## Code for ChromBPNet
- https://github.com/kundajelab/chrombpnet/
# License & citation
External data users may freely download, analyze and publish results based on any ENCODE data without restrictions.
Released under the [ENCODE data-use policy](https://www.encodeproject.org/about/data-use-policy/). Please cite the ENCODE Project Consortium and the model software: [ChromBPNet](https://github.com/kundajelab/chrombpnet) (Pampari et al., bioRxiv 2024). |