Add files using upload-large-folder tool

4fd83f8 verified about 24 hours ago

5.94 kB

	---
	license: mit
	library_name: chrombpnet
	tags:
	- encode
	- chrombpnet
	- chromatin-accessibility
	- DNASE
	- t-cell
	- hg38
	---
	# ENCODE ChromBPNet Atlas
	As part of the ENCODE 4 Project, we trained ChromBPNet models on 1,512 ENCODE DNAse-seq and ATAC-seq across 408 biosamples. Here, we provide all models for open-source use.

	For more information about the models, see:
	- Main ENCODE 4 Paper
	- [A unified lexicon of predictive DNA sequence motifs from ENCODE transcription factor binding and chromatin accessibility assays](https://doi.org/10.5281/zenodo.17123347)
	- [ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants](https://doi.org/10.1101/2024.12.25.630221)

	## ChromBPNet model: DNASE in activated CD4-positive, alpha-beta T cell (ENCSR679EFH)
	- Model: ChromBPNet
	- Assay: DNASE-seq
	- Experiment: [ENCSR679EFH](https://www.encodeproject.org/experiments/ENCSR679EFH/)
	- Model annotation: [ENCSR434QWM](https://www.encodeproject.org/annotations/ENCSR434QWM/)
	- Biosample: activated CD4-positive, alpha-beta T cell (Homo sapiens activated CD4-positive, alpha-beta T cell female adult (33 years) treated with anti-CD3 and anti-CD28 coated beads for 24 hours)
	- Cell slim(s): T-cell,leukocyte,hematopoietic-cell,CD4+-T-cell
	- Organ slim(s): bodily-fluid,blood
	- Developmental slim(s): mesoderm,endoderm
	- System slim(s): immune-system
	- Assembly: hg38

	## Directory structure
	- `fold_0`: Model: Cross-validation fold: Fold 0
	- `model.chrombpnet.fold_0.encid.h5`: full chrombpnet model that combines both bias and corrected model in .h5 format
	- `model.chrombpnet_nobias.fold_0.encid.h5`: bias-corrected accessibility model in .h5 format (Use for all biological discovery)
	- `model.bias_scaled.fold_0.encid.h5`: bias model in .h5 format
	- `model.chrombpnet.fold_0.encid.tar`: full chrombpnet model that combines both bias and corrected model in SavedModel format. After being untarred, it results in a directory named "chrombpnet".
	- `model.chrombpnet_nobias.fold_0.encid.tar`: bias-corrected accessibility model in SavedModel format (Use for all biological discovery). After being untarred, it results in a directory named "chrombpnet_wo_bias".
	- `model.bias_scaled.fold_0.encid.tar`: bias model in SavedModel format. After being untarred, it results in a directory named "bias_model_scaled".
	- `logs.models.fold_0.encid`: folder containing log files for training models
	- `fold_1`: Model: Cross-validation fold: Fold 1
	- `fold_2`: Model: Cross-validation fold: Fold 2
	- `fold_3`: Model: Cross-validation fold: Fold 3
	- `fold_4`: Model: Cross-validation fold: Fold 4

	# Instructions
	## (1) Pseudocode for loading models in .h5 format

	(1) Use the code in python after appropriately defining `model_in_h5_format` and `inputs`.
	(2) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the
	number of tested sequences, 2114 is the input sequence length and 4 corresponds to [A,C,G,T].

	```python
	import tensorflow as tf
	from tensorflow.keras.utils import get_custom_objects
	from tensorflow.keras.models import load_model

	custom_objects={"tf": tf}
	get_custom_objects().update(custom_objects)

	model=load_model(model_in_h5_format,compile=False)
	outputs = model(inputs)
	```

	The list `outputs` consists of two elements. The first element has a shape of (N, 1000) and
	contains logit predictions for a 1000-base-pair output. The second element, with a shape of
	(N, 1), contains logcount predictions. To transform these predictions into per-base signals,
	follow the provided pseudo code lines below.

	```python
	import numpy as np

	def softmax(x, temp=1):
	norm_x = x - np.mean(x,axis=1, keepdims=True)
	return np.exp(tempnorm_x)/np.sum(np.exp(tempnorm_x), axis=1, keepdims=True)

	predictions = softmax(outputs[0]) * (np.exp(outputs[1])-1)
	```

	## (2) Pseudocode for loading models in .tar format

	(1) First untar the directory as follows `tar -xvf model.tar`
	(2) Use the code below in python after appropriately defining `model_dir_untared` and `inputs`
	(3) `inputs` is a one hot encoded sequence of shape (N,2114,4). Here N corresponds to the number
	of tested sequences, 2114 is the input sequence length and 4 corresponds to ACGT.

	Reference: https://www.tensorflow.org/api_docs/python/tf/saved_model/load

	```python
	import tensorflow as tf

	model = tf.saved_model.load('model_dir_untared')
	outputs = model.signatures['serving_default'](**{'sequence':inputs.astype('float32')})
	```

	The variable `outputs` represents a dictionary containing two key-value pairs. The first key
	is `logits_profile_predictions`, holding a value with a shape of (N, 1000). This value corresponds
	to logit predictions for a 1000-base-pair output. The second key, named `logcount_predictions``,
	is associated with a value of shape (N, 1), representing logcount predictions. To transform these
	predictions into per-base signals, utilize the provided pseudo code lines mentioned below.

	```python
	import numpy as np
	def softmax(x, temp=1):
	norm_x = x - np.mean(x,axis=1, keepdims=True)
	return np.exp(tempnorm_x)/np.sum(np.exp(tempnorm_x), axis=1, keepdims=True)

	predictions = softmax(outputs["logits_profile_predictions"]) * (np.exp(outputs["logcount_predictions"])-1)
	```

	## Docker image to load and use the models
	https://hub.docker.com/r/kundajelab/chrombpnet-atlas/ (tag:v1)

	## Code for ChromBPNet
	- https://github.com/kundajelab/chrombpnet/

	# License & citation
	External data users may freely download, analyze and publish results based on any ENCODE data without restrictions.

	Released under the [ENCODE data-use policy](https://www.encodeproject.org/about/data-use-policy/). Please cite the ENCODE Project Consortium and the model software: [ChromBPNet](https://github.com/kundajelab/chrombpnet) (Pampari et al., bioRxiv 2024).