chorus-epinformerseq-v2
Per-cell EPInformer-seq checkpoints for 11 Roadmap Epigenomics cell types.
Drop-in artifacts for the Chorus
epinformerseq oracle.
Architecture
PerCellProfileNetWide โ a dilated-CNN, 2-channel profile model. A
2114-bp input is run through the body, then the central 1024 bp is
cropped for the heads (ChromBPNet-style "valid" geometry, so every output base
has a full real-sequence receptive field). The two channels are:
- ch0 โ DNase: 5โฒ cut-sites (motif-sensitive, sharp).
- ch1 โ H3K27ac: read signal (the active-enhancer histone mark).
Each cell line gets its own main checkpoint (no FiLM, no cell embedding),
paired with a per-cell frozen 2-channel BiasNet (1024-bp, run on the
central crop): ch0 subtracts the Tn5/DNase enzymatic cut bias, ch1 the H3K27ac
background bias. H3K27ac has no enzymatic cut, so no separate bias model is
needed.
The retired joint CellCondProfileNet (FiLM over 11 cells) and the earlier
1024-bp SAME-padded PerCellProfileNet are not used by Chorus.
Layout
per_cell_widewin/
K562/main.pt # PerCellProfileNetWide state_dict (~136K params)
... (11 cells)
bias/
K562/bias.pt # 2-channel BiasNet state_dict (~37K params, frozen)
... (11 cells)
Shipped weights (2026-06-04): the
roadmapretrain โ trained on Roadmap DNase-summit peaks with the Roadmap-pipeline H3K27ac. Strongest per-cell DNase test-r and a functional H3K27ac channel; supersedes an earlier ENCODE-IDR-peak / full-coverage-H3K27ac variant.
Training
- Window: 2114-bp input โ central 1024-bp profile crop, on roadmap DNase peak summits.
- Signals: ch0 = per-bp 5โฒ DNase cut-site counts; ch1 = per-bp H3K27ac signal (Roadmap v2.1 pipeline).
- Data: ENCODE DNase + H3K27ac BAMs (SE/PE), fold-10 leave-chromosomes-out split (chr11+chr21 held out).
- Loss: multinomial NLL on the per-bp 2-channel profile + MSE on log10 count per channel. AdamW + OneCycleLR.
- Per-cell held-out test-
r: DNase 0.59โ0.88, H3K27ac 0.48โ0.67.
Usage
from chorus.oracles import EPInformerSeqOracle
oracle = EPInformerSeqOracle(cell_type="K562")
oracle.load_pretrained_model() # downloads from this repo on first run
result = oracle.predict(
sequence="A" * 2114,
assay_ids=["Enhancer_DNase:K562"], # or Enhancer_H3K27ac, Enhancer_H3K27ac_DNase
)
Available cells: K562, GM12878, HepG2, A549, H1, HeLa, HMEC, HSMM, HUVEC, NHEK, NHLF.
Available assays (max over the central 256 bp of the 1024-bp output):
Enhancer_DNase (default, max DNase), Enhancer_H3K27ac (max H3K27ac),
Enhancer_H3K27ac_DNase (composite sqrt(max DNase ยท max H3K27ac)).
Background CDFs
Chorus pulls per-track CDFs from the companion dataset
lucapinello/chorus-backgrounds
(epinformerseq_pertrack.npz, 33 tracks = 3 assays ร 11 cells).
Citation
Cite the EPInformer paper and the Chorus pipeline (see the Chorus README).