PRO-cap Atlas BPNet Models
This repository contains BPNet models trained on the ENCODE PRO-cap atlas for sequence-based prediction of strand-specific transcription initiation signal on the human GRCh38/hg38 genome. The collection includes 224 experiment-specific models spanning 126 unique biosamples, with one model directory per ENCODE experiment accession and seven chromosome-held-out cross-validation folds per experiment.
Each model takes a 2,114 bp one-hot-encoded DNA sequence window as input and predicts a 1,000 bp PRO-cap initiation profile for the plus and minus strands, along with corresponding count predictions.
Model Details
- Developed by: Adam Y. He and Anshul Kundaje
- Model type: BPNet
- Library:
bpnet-lite - Assay: PRO-cap
- Organism: Homo sapiens
- Genome assembly: GRCh38/hg38
- License: MIT
- Collection: https://huggingface.co/collections/adamyhe/procap-atlas
- Model repo: https://huggingface.co/adamyhe/procap-atlas-bpnet/
- Atlas metadata: https://huggingface.co/datasets/adamyhe/procap-atlas-metadata
- Companion track dataset: https://huggingface.co/datasets/adamyhe/procap-atlas-tracks
- Code repository: https://github.com/kundajelab/procap-atlas
- UCSC track hub: https://huggingface.co/datasets/adamyhe/procap-atlas-tracks/resolve/main/ucsc/hub.txt
Intended Use
These models are intended for research use in regulatory genomics, especially for analyzing sequence determinants of transcription start site activity, promoter-proximal initiation, and PRO-cap signal shape across ENCODE biosamples.
The models can also be used as components in downstream research workflows for variant effect exploration, motif analysis, or comparison of initiation programs across cell types and tissues. Downstream users should validate performance for their own loci, biosamples, and analysis settings.
Limitations
- The models are trained for human GRCh38/hg38 sequence and should not be assumed to transfer to other species.
- Predictions are PRO-cap-like initiation signals near trained peak loci, not general-purpose gene expression estimates. Background genome or completely synthetic sequences are likely to be OOD and can exhibit strange behaviors. Centromeres, LTR, and other repetitive sequences are known to be problematic.
- ENCODE biosample coverage is uneven, and model behavior may vary across cell types, tissues, coverage levels, and assay quality. See dataset metadata and quality flags in the metadata repo linked above.
- Performance varies by experiment and read depth; inspect per-experiment benchmark results before using a model for downstream analysis.
- These models are research artifacts and should not be used for clinical or diagnostic decision-making.
Training Details
Training inputs come from ENCODE PRO-cap plus- and minus-strand signal tracks and processed peak sets generated by the companion procap-atlas preprocessing pipeline. Processed training data will be made available at adamyhe/procap-atlas-tracks.
Each experiment is trained across seven chromosome folds defined in configs/chrom_splits.yaml. For model fold i, data fold i is held out for testing, data fold (i + 1) % 7 is used for validation, and the remaining data folds are used for training.
Default training uses GC-matched negatives at a ratio of 1/7 negatives per positive example. Training also uses reverse-complement augmentation, up to 200 bp sequence jitter, AdamW optimization, batch size 64, learning rate 5e-4, 512 filters, 8 BPNet layers, and count loss weight 100.
Evaluation
Models are evaluated on held-out test chromosomes for each fold. Metrics include profile Pearson correlation, profile Jensen-Shannon distance, log-counts Pearson correlation, and counts Spearman correlation.
Across the 224 experiment benchmark files currently present in the project, median genome-wide performance is:
| Metric | Median |
|---|---|
| Profile Pearson correlation | 0.490 |
| Profile Jensen-Shannon distance | 0.357 |
| Log-counts Pearson correlation | 0.687 |
| Counts Spearman correlation | 0.701 |
Exact per-experiment results are written by the project benchmark workflow to performance_metrics/bpnet/{experiment}.json. Please note that model performance depends heavily on PRO-cap library size, quality, and construction details. Please consult the precomputed performance metrics and the atlas metadata to determine if the model will be suitable for your use case.
Interpretability
The project includes workflows for DeepLIFT/SHAP-style BPNet attributions, observed-nucleotide attribution BigWig conversion, and TF-MoDISco motif discovery. Attribution outputs are generated per experiment and prediction head and can be viewed alongside PRO-cap signal tracks in genome browsers.
See the attribution and MoDISco workflows in the src/bpnet directory for details.
How to Use
Download model artifacts from this Hugging Face model repository. Models are organized by ENCODE experiment accession, and each experiment contains seven fold checkpoints named by fold index.
Individual model checkpoints are saved as bpnetlite.bpnet.BPNet modules, and so can be directly loaded provided that package is installed:
import torch
model = torch.load("/path/to/model/ENCSR882DWM.fold0.torch", weights_only=False)
The companion GitHub repository contains scripts for preprocessing, inference, benchmarking, attribution, motif discovery:
git clone https://github.com/kundajelab/procap-atlas.git
cd procap-atlas
mamba env create -f environment.yml
mamba activate procap-atlas
python src/bpnet/benchmark/benchmark_bpnet.py -e ENCSR882DWM
python src/bpnet/attribute/attribute_bpnet.py -e ENCSR882DWM
See the repository README and src/bpnet/README.md for the additional details.
Citation
If you use these models, please cite the PRO-cap atlas model repository and the underlying software dependencies used for training and interpretation, including bpnet-lite, tangermeme, and tfmodisco.
Contact
For questions, bug reports, or reuse notes, please use the GitHub repository issues: https://github.com/kundajelab/procap-atlas/issues
- Downloads last month
- -