File size: 5,944 Bytes
4fd83f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: mit
library_name: chrombpnet
tags:
  - encode
  - chrombpnet
  - chromatin-accessibility
  - DNASE
  - t-cell
  - hg38
---
# ENCODE ChromBPNet Atlas
As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use.

For more information about the models, see:
- Main ENCODE 4 Paper
- [A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays](https://doi.org/10.5281/zenodo.17123347)
- [ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants](https://doi.org/10.1101/2024.12.25.630221)

## ChromBPNet model: DNASE in activated CD4-positive, alpha-beta T cell (ENCSR679EFH)
- Model: ChromBPNet
- Assay: DNASE-seq
- Experiment: [ENCSR679EFH](https://www.encodeproject.org/experiments/ENCSR679EFH/)
- Model annotation: [ENCSR434QWM](https://www.encodeproject.org/annotations/ENCSR434QWM/)
- Biosample: activated CD4-positive, alpha-beta T cell (Homo sapiens activated CD4-positive, alpha-beta T cell female adult (33 years) treated with anti-CD3 and anti-CD28 coated beads for 24 hours)
- Cell slim(s): T-cell,leukocyte,hematopoietic-cell,CD4+-T-cell
- Organ slim(s): bodily-fluid,blood
- Developmental slim(s): mesoderm,endoderm
- System slim(s): immune-system
- Assembly: hg38

## Directory structure
- `fold_0`: Model: Cross-validation fold: Fold 0
    - `model.chrombpnet.fold_0.encid.h5`: full chrombpnet model that combines both bias and corrected model in .h5 format
    - `model.chrombpnet_nobias.fold_0.encid.h5`: bias-corrected accessibility model in .h5 format (Use for all biological discovery)
    - `model.bias_scaled.fold_0.encid.h5`: bias model in .h5 format
    - `model.chrombpnet.fold_0.encid.tar`: full chrombpnet model that combines both bias and  corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet".
    - `model.chrombpnet_nobias.fold_0.encid.tar`: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias".
    - `model.bias_scaled.fold_0.encid.tar`: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled".
    - `logs.models.fold_0.encid`: folder containing log files for training models
- `fold_1`: Model: Cross-validation fold: Fold 1
- `fold_2`: Model: Cross-validation fold: Fold 2
- `fold_3`: Model: Cross-validation fold: Fold 3
- `fold_4`: Model: Cross-validation fold: Fold 4

# Instructions
## (1) Pseudocode for loading models in .h5 format 

(1) Use the code in python after appropriately defining `model_in_h5_format` and `inputs`. 
(2) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the 
number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T].

```python
import tensorflow as tf
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.models import load_model

custom_objects={"tf": tf}
get_custom_objects().update(custom_objects)

model=load_model(model_in_h5_format,compile=False)
outputs = model(inputs)
```

The list `outputs` consists of two elements. The first element has a shape of (N, 1000) and
contains logit predictions for a 1000-base-pair output. The second element, with a shape of
(N, 1), contains logcount predictions. To transform these predictions into per-base signals, 
follow the provided pseudo code lines below.

```python
import numpy as np

def softmax(x, temp=1):
    norm_x = x - np.mean(x,axis=1, keepdims=True)
    return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
    
predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1)
```

## (2) Pseudocode for loading models in .tar format

(1) First untar the directory as follows `tar -xvf model.tar`
(2) Use the code below in python after appropriately defining `model_dir_untared` and `inputs`
(3) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number
of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT.

Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load

```python
import tensorflow as tf

model = tf.saved_model.load('model_dir_untared')
outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')})
```

The variable `outputs` represents a dictionary containing two key-value pairs. The first key
is `logits_profile_predictions`, holding a value with a shape of (N, 1000). This value corresponds
to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``, 
is associated with a value of shape (N, 1), representing logcount predictions. To transform these
predictions into per-base signals, utilize the provided pseudo code lines mentioned below.

```python
import numpy as np
def softmax(x, temp=1):
    norm_x = x - np.mean(x,axis=1, keepdims=True)
    return np.exp(temp*norm_x)/np.sum(np.exp(temp*norm_x), axis=1, keepdims=True)
    
predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1)
```

## Docker image to load and use the models
https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1)

## Code for ChromBPNet
- https://github.com/kundajelab/chrombpnet/

# License & citation
External data users may freely download, analyze and publish results based on any ENCODE data without restrictions.

Released under the [ENCODE data-use policy](https://www.encodeproject.org/about/data-use-policy/). Please cite the ENCODE Project Consortium and the model software: [ChromBPNet](https://github.com/kundajelab/chrombpnet) (Pampari et al., bioRxiv 2024).