PRO-cap Atlas BPNet Models

This repository contains BPNet models trained on the ENCODE PRO-cap atlas for sequence-based prediction of strand-specific transcription initiation signal on the human GRCh38/hg38 genome. The collection includes 224 experiment-specific models spanning 126 unique biosamples, with one model directory per ENCODE experiment accession and seven chromosome-held-out cross-validation folds per experiment.

Each model takes a 2,114 bp one-hot-encoded DNA sequence window as input and predicts a 1,000 bp PRO-cap initiation profile for the plus and minus strands, along with corresponding count predictions.

Model Details

Developed by: Adam Y. He and Anshul Kundaje
Model type: BPNet
Library: bpnet-lite
Assay: PRO-cap
Organism: Homo sapiens
Genome assembly: GRCh38/hg38
License: MIT
Collection: https://huggingface.co/collections/adamyhe/procap-atlas
Model repo: https://huggingface.co/adamyhe/procap-atlas-bpnet/
Atlas metadata: https://huggingface.co/datasets/adamyhe/procap-atlas-metadata
Companion track dataset: https://huggingface.co/datasets/adamyhe/procap-atlas-tracks
Code repository: https://github.com/kundajelab/procap-atlas
UCSC track hub: https://huggingface.co/datasets/adamyhe/procap-atlas-tracks/resolve/main/ucsc/hub.txt

Intended Use

These models are intended for research use in regulatory genomics, especially for analyzing sequence determinants of transcription start site activity, promoter-proximal initiation, and PRO-cap signal shape across ENCODE biosamples.

The models can also be used as components in downstream research workflows for variant effect exploration, motif analysis, or comparison of initiation programs across cell types and tissues. Downstream users should validate performance for their own loci, biosamples, and analysis settings.

Limitations

The models are trained for human GRCh38/hg38 sequence and should not be assumed to transfer to other species.
Predictions are PRO-cap-like initiation signals near trained peak loci, not general-purpose gene expression estimates. Background genome or completely synthetic sequences are likely to be OOD and can exhibit strange behaviors. Centromeres, LTR, and other repetitive sequences are known to be problematic.
ENCODE biosample coverage is uneven, and model behavior may vary across cell types, tissues, coverage levels, and assay quality. See dataset metadata and quality flags in the metadata repo linked above.
Performance varies by experiment and read depth; inspect per-experiment benchmark results before using a model for downstream analysis.
These models are research artifacts and should not be used for clinical or diagnostic decision-making.

Training Details

Training inputs come from ENCODE PRO-cap plus- and minus-strand signal tracks and processed peak sets generated by the companion procap-atlas preprocessing pipeline. Processed training data will be made available at adamyhe/procap-atlas-tracks.

Each experiment is trained across seven chromosome folds defined in configs/chrom_splits.yaml. For model fold i, data fold i is held out for testing, data fold (i + 1) % 7 is used for validation, and the remaining data folds are used for training.

Default training uses GC-matched negatives at a ratio of 1/7 negatives per positive example. Training also uses reverse-complement augmentation, up to 200 bp sequence jitter, AdamW optimization, batch size 64, learning rate 5e-4, 512 filters, 8 BPNet layers, and count loss weight 100.

Evaluation

Models are evaluated on held-out test chromosomes for each fold. Metrics include profile Pearson correlation, profile Jensen-Shannon distance, log-counts Pearson correlation, and counts Spearman correlation.

Across the 224 experiment benchmark files currently present in the project, median genome-wide performance is:

Metric	Median
Profile Pearson correlation	0.490
Profile Jensen-Shannon distance	0.357
Log-counts Pearson correlation	0.687
Counts Spearman correlation	0.701

Exact per-experiment results are written by the project benchmark workflow to performance_metrics/bpnet/{experiment}.json. Please note that model performance depends heavily on PRO-cap library size, quality, and construction details. Please consult the precomputed performance metrics and the atlas metadata to determine if the model will be suitable for your use case.

Interpretability

The project includes workflows for DeepLIFT/SHAP-style BPNet attributions, observed-nucleotide attribution BigWig conversion, and TF-MoDISco motif discovery. Attribution outputs are generated per experiment and prediction head and can be viewed alongside PRO-cap signal tracks in genome browsers.

See the attribution and MoDISco workflows in the src/bpnet directory for details.

How to Use

Download model artifacts from this Hugging Face model repository. Models are organized by ENCODE experiment accession, and each experiment contains seven fold checkpoints named by fold index.

Individual model checkpoints are saved as bpnetlite.bpnet.BPNet modules, and so can be directly loaded provided that package is installed:

import torch

model = torch.load("/path/to/model/ENCSR882DWM.fold0.torch", weights_only=False)

The companion GitHub repository contains scripts for preprocessing, inference, benchmarking, attribution, motif discovery:

git clone https://github.com/kundajelab/procap-atlas.git
cd procap-atlas
mamba env create -f environment.yml
mamba activate procap-atlas

python src/bpnet/benchmark/benchmark_bpnet.py -e ENCSR882DWM
python src/bpnet/attribute/attribute_bpnet.py -e ENCSR882DWM

See the repository README and src/bpnet/README.md for the additional details.

Citation

If you use these models, please cite the PRO-cap atlas model repository and the underlying software dependencies used for training and interpretation, including bpnet-lite, tangermeme, and tfmodisco.

Contact

For questions, bug reports, or reuse notes, please use the GitHub repository issues: https://github.com/kundajelab/procap-atlas/issues

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train adamyhe/procap-atlas-bpnet

Collection including adamyhe/procap-atlas-bpnet

procap-atlas

Collection

Sequence-to-activity models of PRO-cap (nascent transcription initiation) experiments from the ENCODE project • 4 items • Updated May 24