Upload folder using huggingface_hub
#1
by
avantikalal
- opened
- README.md +59 -3
- human.ckpt +3 -0
- human_state_dict.h5 +3 -0
- mouse.ckpt +3 -0
- mouse_state_dict.h5 +3 -0
- save_wandb_enformer_human.ipynb +1002 -0
- save_wandb_enformer_mouse.ipynb +979 -0
README.md
CHANGED
|
@@ -1,3 +1,59 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
library_name: pytorch-lightning
|
| 4 |
+
pipeline_tag: tabular-regression
|
| 5 |
+
tags:
|
| 6 |
+
- biology
|
| 7 |
+
- genomics
|
| 8 |
+
datasets:
|
| 9 |
+
- Genentech/enformer-data
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Enformer Model (Avsec et al. 2021)
|
| 13 |
+
|
| 14 |
+
## Model Description
|
| 15 |
+
This repository contains the weights for the Enformer model, a long-range transformer architecture designed to predict functional genomic tracks from genomic DNA sequences.
|
| 16 |
+
|
| 17 |
+
- **Architecture:** Convolutions followed by Transformer layers.
|
| 18 |
+
- **Input:** 196,608 bp of genomic DNA sequence.
|
| 19 |
+
- **Output Resolution:** 128 bp bins.
|
| 20 |
+
- **Source:** [Avsec, Ž. et al. Nature Methods (2021)](https://www.nature.com/articles/s41592-021-01252-x)
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## Repository Content
|
| 24 |
+
The repository includes both full PyTorch Lightning checkpoints and raw state dictionaries for the human and mouse versions of the model. Note that the weights are derived from the publication but the model has been converted into the PyTorch Lightning format used by gReLU (https://github.com/Genentech/gReLU).
|
| 25 |
+
|
| 26 |
+
| File | Type | Description |
|
| 27 |
+
| :--- | :--- | :--- |
|
| 28 |
+
| `human.ckpt` | PyTorch Lightning | Full checkpoint including base model and human head. |
|
| 29 |
+
| `mouse.ckpt` | PyTorch Lightning | Full checkpoint including base model and mouse head. |
|
| 30 |
+
| `human_state_dict.h5` | HDF5 | Weights-only state dictionary for the human model. |
|
| 31 |
+
| `mouse_state_dict.h5` | HDF5 | Weights-only state dictionary for the mouse model. |
|
| 32 |
+
| `save_wandb_enformer_human.ipynb` | Jupyter Notebook | Code used to create `human.ckpt` |
|
| 33 |
+
| `save_wandb_enformer_mouse.ipynb` | Jupyter Notebook | Code used to create `mouse.ckpt` |
|
| 34 |
+
|
| 35 |
+
## Model Heads & Output Tracks
|
| 36 |
+
Both `.ckpt` files utilize the same core transformer trunk but differ in their species-specific output heads.
|
| 37 |
+
|
| 38 |
+
### Outputs
|
| 39 |
+
|
| 40 |
+
Human Head: 5,313 total tracks
|
| 41 |
+
Mouse Head: 1,643 total tracks
|
| 42 |
+
|
| 43 |
+
## Usage
|
| 44 |
+
The models are intended for use with the `grelu` library.
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
from grelu.lightning import LightningModel
|
| 48 |
+
from huggingface_hub import hf_hub_download
|
| 49 |
+
|
| 50 |
+
# Download the desired checkpoint
|
| 51 |
+
ckpt_path = hf_hub_download(
|
| 52 |
+
repo_id="Genentech/enformer-model",
|
| 53 |
+
filename="human.ckpt"
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
# Load the model
|
| 57 |
+
model = LightningModel.load_from_checkpoint(ckpt_path)
|
| 58 |
+
model.eval()
|
| 59 |
+
```
|
human.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:030e706b7b64c744be2c363b73e822694cf404164e8720cafb5940cf81da4db6
|
| 3 |
+
size 986737142
|
human_state_dict.h5
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b9b9080c6ddfe1285175bb909b1440c427527784db9134757bbbd986c826de4f
|
| 3 |
+
size 984921590
|
mouse.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:19246b404028c4dd943265c519d51d796889c712091547539ecd7d8ad0d5e29b
|
| 3 |
+
size 940724534
|
mouse_state_dict.h5
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f97535122efb146d998ca7f4bbcee86ddf1d4d1c8cec5e5d9f8e5fcd42cbe2ff
|
| 3 |
+
size 939809910
|
save_wandb_enformer_human.ipynb
ADDED
|
@@ -0,0 +1,1002 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": 1,
|
| 6 |
+
"id": "0b100814-a834-4c18-abbe-9cab6ea1278c",
|
| 7 |
+
"metadata": {},
|
| 8 |
+
"outputs": [
|
| 9 |
+
{
|
| 10 |
+
"name": "stderr",
|
| 11 |
+
"output_type": "stream",
|
| 12 |
+
"text": [
|
| 13 |
+
"/opt/conda/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 14 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 15 |
+
]
|
| 16 |
+
}
|
| 17 |
+
],
|
| 18 |
+
"source": [
|
| 19 |
+
"import wandb\n",
|
| 20 |
+
"import torch\n",
|
| 21 |
+
"import pandas as pd\n",
|
| 22 |
+
"\n",
|
| 23 |
+
"from grelu.lightning import LightningModel\n",
|
| 24 |
+
"import pytorch_lightning as pl\n",
|
| 25 |
+
"from grelu.sequence.utils import get_unique_length, resize"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "markdown",
|
| 30 |
+
"id": "cb22e3f0-8ef7-41f3-aefc-4a2d182af5ba",
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"source": [
|
| 33 |
+
"## wandb login"
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"cell_type": "code",
|
| 38 |
+
"execution_count": 2,
|
| 39 |
+
"id": "137c5253-351e-4945-88a3-a4b7c555326c",
|
| 40 |
+
"metadata": {},
|
| 41 |
+
"outputs": [
|
| 42 |
+
{
|
| 43 |
+
"name": "stderr",
|
| 44 |
+
"output_type": "stream",
|
| 45 |
+
"text": [
|
| 46 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.\n",
|
| 47 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mavantikalal\u001b[0m (\u001b[33mgrelu\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n"
|
| 48 |
+
]
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"data": {
|
| 52 |
+
"text/plain": [
|
| 53 |
+
"True"
|
| 54 |
+
]
|
| 55 |
+
},
|
| 56 |
+
"execution_count": 2,
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"output_type": "execute_result"
|
| 59 |
+
}
|
| 60 |
+
],
|
| 61 |
+
"source": [
|
| 62 |
+
"wandb.login(host=\"https://api.wandb.ai\")"
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "code",
|
| 67 |
+
"execution_count": 12,
|
| 68 |
+
"id": "d372dc19-d701-407f-9303-40adc76f475c",
|
| 69 |
+
"metadata": {},
|
| 70 |
+
"outputs": [
|
| 71 |
+
{
|
| 72 |
+
"data": {
|
| 73 |
+
"text/html": [
|
| 74 |
+
"Tracking run with wandb version 0.19.7"
|
| 75 |
+
],
|
| 76 |
+
"text/plain": [
|
| 77 |
+
"<IPython.core.display.HTML object>"
|
| 78 |
+
]
|
| 79 |
+
},
|
| 80 |
+
"metadata": {},
|
| 81 |
+
"output_type": "display_data"
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"data": {
|
| 85 |
+
"text/html": [
|
| 86 |
+
"Run data is saved locally in <code>/code/github/gReLU-applications/enformer/wandb/run-20250304_222603-rwin6f0o</code>"
|
| 87 |
+
],
|
| 88 |
+
"text/plain": [
|
| 89 |
+
"<IPython.core.display.HTML object>"
|
| 90 |
+
]
|
| 91 |
+
},
|
| 92 |
+
"metadata": {},
|
| 93 |
+
"output_type": "display_data"
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"data": {
|
| 97 |
+
"text/html": [
|
| 98 |
+
"Syncing run <strong><a href='https://wandb.ai/grelu/enformer/runs/rwin6f0o' target=\"_blank\">copy-human</a></strong> to <a href='https://wandb.ai/grelu/enformer' target=\"_blank\">Weights & Biases</a> (<a href='https://wandb.me/developer-guide' target=\"_blank\">docs</a>)<br>"
|
| 99 |
+
],
|
| 100 |
+
"text/plain": [
|
| 101 |
+
"<IPython.core.display.HTML object>"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
"metadata": {},
|
| 105 |
+
"output_type": "display_data"
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"data": {
|
| 109 |
+
"text/html": [
|
| 110 |
+
" View project at <a href='https://wandb.ai/grelu/enformer' target=\"_blank\">https://wandb.ai/grelu/enformer</a>"
|
| 111 |
+
],
|
| 112 |
+
"text/plain": [
|
| 113 |
+
"<IPython.core.display.HTML object>"
|
| 114 |
+
]
|
| 115 |
+
},
|
| 116 |
+
"metadata": {},
|
| 117 |
+
"output_type": "display_data"
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"data": {
|
| 121 |
+
"text/html": [
|
| 122 |
+
" View run at <a href='https://wandb.ai/grelu/enformer/runs/rwin6f0o' target=\"_blank\">https://wandb.ai/grelu/enformer/runs/rwin6f0o</a>"
|
| 123 |
+
],
|
| 124 |
+
"text/plain": [
|
| 125 |
+
"<IPython.core.display.HTML object>"
|
| 126 |
+
]
|
| 127 |
+
},
|
| 128 |
+
"metadata": {},
|
| 129 |
+
"output_type": "display_data"
|
| 130 |
+
}
|
| 131 |
+
],
|
| 132 |
+
"source": [
|
| 133 |
+
"run = wandb.init(entity='grelu', project='enformer', job_type='copy', name='copy-human',\n",
|
| 134 |
+
" settings=wandb.Settings(\n",
|
| 135 |
+
" program_relpath='/code/github/gReLU-applications/enformer/save_wandb_enformer_human.ipynb',\n",
|
| 136 |
+
" program_abspath='/code/github/gReLU-applications/enformer/save_wandb_enformer_human.ipynb'\n",
|
| 137 |
+
" ))"
|
| 138 |
+
]
|
| 139 |
+
},
|
| 140 |
+
{
|
| 141 |
+
"cell_type": "code",
|
| 142 |
+
"execution_count": 13,
|
| 143 |
+
"id": "d116934e-657f-49af-a3de-1e0b4d6e895d",
|
| 144 |
+
"metadata": {},
|
| 145 |
+
"outputs": [
|
| 146 |
+
{
|
| 147 |
+
"name": "stderr",
|
| 148 |
+
"output_type": "stream",
|
| 149 |
+
"text": [
|
| 150 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m No relevant files were detected in the specified directory. No code will be logged to your run.\n"
|
| 151 |
+
]
|
| 152 |
+
}
|
| 153 |
+
],
|
| 154 |
+
"source": [
|
| 155 |
+
"wandb.run.log_code() "
|
| 156 |
+
]
|
| 157 |
+
},
|
| 158 |
+
{
|
| 159 |
+
"cell_type": "markdown",
|
| 160 |
+
"id": "4564271b-2a71-47e3-bfc4-9b0f4c616a73",
|
| 161 |
+
"metadata": {},
|
| 162 |
+
"source": [
|
| 163 |
+
"## Paths"
|
| 164 |
+
]
|
| 165 |
+
},
|
| 166 |
+
{
|
| 167 |
+
"cell_type": "code",
|
| 168 |
+
"execution_count": 14,
|
| 169 |
+
"id": "86052145-d3a9-4f6a-9b23-8353b5dd38e1",
|
| 170 |
+
"metadata": {},
|
| 171 |
+
"outputs": [],
|
| 172 |
+
"source": [
|
| 173 |
+
"targets_path = 'https://raw.githubusercontent.com/calico/basenji/master/manuscripts/cross2020/targets_human.txt'"
|
| 174 |
+
]
|
| 175 |
+
},
|
| 176 |
+
{
|
| 177 |
+
"cell_type": "code",
|
| 178 |
+
"execution_count": 15,
|
| 179 |
+
"id": "97aed0b6-8b61-424a-8366-37c204b061d8",
|
| 180 |
+
"metadata": {},
|
| 181 |
+
"outputs": [],
|
| 182 |
+
"source": [
|
| 183 |
+
"sequences_path = '/gstore/data/resbioai/grelu/enformer/sequences.bed'"
|
| 184 |
+
]
|
| 185 |
+
},
|
| 186 |
+
{
|
| 187 |
+
"cell_type": "markdown",
|
| 188 |
+
"id": "915c74f7-15de-4ae6-a9c9-0142fa885f1e",
|
| 189 |
+
"metadata": {},
|
| 190 |
+
"source": [
|
| 191 |
+
"## Process tasks"
|
| 192 |
+
]
|
| 193 |
+
},
|
| 194 |
+
{
|
| 195 |
+
"cell_type": "code",
|
| 196 |
+
"execution_count": 16,
|
| 197 |
+
"id": "2d8d16a8-95a9-4902-bec9-a25b44a5876e",
|
| 198 |
+
"metadata": {},
|
| 199 |
+
"outputs": [
|
| 200 |
+
{
|
| 201 |
+
"name": "stdout",
|
| 202 |
+
"output_type": "stream",
|
| 203 |
+
"text": [
|
| 204 |
+
"5313\n"
|
| 205 |
+
]
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"data": {
|
| 209 |
+
"text/html": [
|
| 210 |
+
"<div>\n",
|
| 211 |
+
"<style scoped>\n",
|
| 212 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 213 |
+
" vertical-align: middle;\n",
|
| 214 |
+
" }\n",
|
| 215 |
+
"\n",
|
| 216 |
+
" .dataframe tbody tr th {\n",
|
| 217 |
+
" vertical-align: top;\n",
|
| 218 |
+
" }\n",
|
| 219 |
+
"\n",
|
| 220 |
+
" .dataframe thead th {\n",
|
| 221 |
+
" text-align: right;\n",
|
| 222 |
+
" }\n",
|
| 223 |
+
"</style>\n",
|
| 224 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 225 |
+
" <thead>\n",
|
| 226 |
+
" <tr style=\"text-align: right;\">\n",
|
| 227 |
+
" <th></th>\n",
|
| 228 |
+
" <th>genome</th>\n",
|
| 229 |
+
" <th>identifier</th>\n",
|
| 230 |
+
" <th>file</th>\n",
|
| 231 |
+
" <th>clip</th>\n",
|
| 232 |
+
" <th>scale</th>\n",
|
| 233 |
+
" <th>sum_stat</th>\n",
|
| 234 |
+
" <th>description</th>\n",
|
| 235 |
+
" </tr>\n",
|
| 236 |
+
" <tr>\n",
|
| 237 |
+
" <th>index</th>\n",
|
| 238 |
+
" <th></th>\n",
|
| 239 |
+
" <th></th>\n",
|
| 240 |
+
" <th></th>\n",
|
| 241 |
+
" <th></th>\n",
|
| 242 |
+
" <th></th>\n",
|
| 243 |
+
" <th></th>\n",
|
| 244 |
+
" <th></th>\n",
|
| 245 |
+
" </tr>\n",
|
| 246 |
+
" </thead>\n",
|
| 247 |
+
" <tbody>\n",
|
| 248 |
+
" <tr>\n",
|
| 249 |
+
" <th>0</th>\n",
|
| 250 |
+
" <td>0</td>\n",
|
| 251 |
+
" <td>ENCFF833POA</td>\n",
|
| 252 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 253 |
+
" <td>32</td>\n",
|
| 254 |
+
" <td>2</td>\n",
|
| 255 |
+
" <td>mean</td>\n",
|
| 256 |
+
" <td>DNASE:cerebellum male adult (27 years) and mal...</td>\n",
|
| 257 |
+
" </tr>\n",
|
| 258 |
+
" <tr>\n",
|
| 259 |
+
" <th>1</th>\n",
|
| 260 |
+
" <td>0</td>\n",
|
| 261 |
+
" <td>ENCFF110QGM</td>\n",
|
| 262 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 263 |
+
" <td>32</td>\n",
|
| 264 |
+
" <td>2</td>\n",
|
| 265 |
+
" <td>mean</td>\n",
|
| 266 |
+
" <td>DNASE:frontal cortex male adult (27 years) and...</td>\n",
|
| 267 |
+
" </tr>\n",
|
| 268 |
+
" <tr>\n",
|
| 269 |
+
" <th>2</th>\n",
|
| 270 |
+
" <td>0</td>\n",
|
| 271 |
+
" <td>ENCFF880MKD</td>\n",
|
| 272 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 273 |
+
" <td>32</td>\n",
|
| 274 |
+
" <td>2</td>\n",
|
| 275 |
+
" <td>mean</td>\n",
|
| 276 |
+
" <td>DNASE:chorion</td>\n",
|
| 277 |
+
" </tr>\n",
|
| 278 |
+
" </tbody>\n",
|
| 279 |
+
"</table>\n",
|
| 280 |
+
"</div>"
|
| 281 |
+
],
|
| 282 |
+
"text/plain": [
|
| 283 |
+
" genome identifier file \\\n",
|
| 284 |
+
"index \n",
|
| 285 |
+
"0 0 ENCFF833POA /home/drk/tillage/datasets/human/dnase/encode/... \n",
|
| 286 |
+
"1 0 ENCFF110QGM /home/drk/tillage/datasets/human/dnase/encode/... \n",
|
| 287 |
+
"2 0 ENCFF880MKD /home/drk/tillage/datasets/human/dnase/encode/... \n",
|
| 288 |
+
"\n",
|
| 289 |
+
" clip scale sum_stat description \n",
|
| 290 |
+
"index \n",
|
| 291 |
+
"0 32 2 mean DNASE:cerebellum male adult (27 years) and mal... \n",
|
| 292 |
+
"1 32 2 mean DNASE:frontal cortex male adult (27 years) and... \n",
|
| 293 |
+
"2 32 2 mean DNASE:chorion "
|
| 294 |
+
]
|
| 295 |
+
},
|
| 296 |
+
"execution_count": 16,
|
| 297 |
+
"metadata": {},
|
| 298 |
+
"output_type": "execute_result"
|
| 299 |
+
}
|
| 300 |
+
],
|
| 301 |
+
"source": [
|
| 302 |
+
"tasks = pd.read_csv(targets_path, sep='\\t', index_col=0)\n",
|
| 303 |
+
"print(len(tasks))\n",
|
| 304 |
+
"tasks.head(3)"
|
| 305 |
+
]
|
| 306 |
+
},
|
| 307 |
+
{
|
| 308 |
+
"cell_type": "code",
|
| 309 |
+
"execution_count": 17,
|
| 310 |
+
"id": "68fd7e09-88be-4185-8983-090a4721dfbf",
|
| 311 |
+
"metadata": {},
|
| 312 |
+
"outputs": [
|
| 313 |
+
{
|
| 314 |
+
"data": {
|
| 315 |
+
"text/html": [
|
| 316 |
+
"<div>\n",
|
| 317 |
+
"<style scoped>\n",
|
| 318 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 319 |
+
" vertical-align: middle;\n",
|
| 320 |
+
" }\n",
|
| 321 |
+
"\n",
|
| 322 |
+
" .dataframe tbody tr th {\n",
|
| 323 |
+
" vertical-align: top;\n",
|
| 324 |
+
" }\n",
|
| 325 |
+
"\n",
|
| 326 |
+
" .dataframe thead th {\n",
|
| 327 |
+
" text-align: right;\n",
|
| 328 |
+
" }\n",
|
| 329 |
+
"</style>\n",
|
| 330 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 331 |
+
" <thead>\n",
|
| 332 |
+
" <tr style=\"text-align: right;\">\n",
|
| 333 |
+
" <th></th>\n",
|
| 334 |
+
" <th>name</th>\n",
|
| 335 |
+
" <th>file</th>\n",
|
| 336 |
+
" <th>clip</th>\n",
|
| 337 |
+
" <th>scale</th>\n",
|
| 338 |
+
" <th>sum_stat</th>\n",
|
| 339 |
+
" <th>description</th>\n",
|
| 340 |
+
" <th>assay</th>\n",
|
| 341 |
+
" <th>sample</th>\n",
|
| 342 |
+
" </tr>\n",
|
| 343 |
+
" </thead>\n",
|
| 344 |
+
" <tbody>\n",
|
| 345 |
+
" <tr>\n",
|
| 346 |
+
" <th>0</th>\n",
|
| 347 |
+
" <td>ENCFF833POA</td>\n",
|
| 348 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 349 |
+
" <td>32</td>\n",
|
| 350 |
+
" <td>2</td>\n",
|
| 351 |
+
" <td>mean</td>\n",
|
| 352 |
+
" <td>DNASE:cerebellum male adult (27 years) and mal...</td>\n",
|
| 353 |
+
" <td>DNASE</td>\n",
|
| 354 |
+
" <td>cerebellum male adult (27 years) and male adul...</td>\n",
|
| 355 |
+
" </tr>\n",
|
| 356 |
+
" <tr>\n",
|
| 357 |
+
" <th>1</th>\n",
|
| 358 |
+
" <td>ENCFF110QGM</td>\n",
|
| 359 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 360 |
+
" <td>32</td>\n",
|
| 361 |
+
" <td>2</td>\n",
|
| 362 |
+
" <td>mean</td>\n",
|
| 363 |
+
" <td>DNASE:frontal cortex male adult (27 years) and...</td>\n",
|
| 364 |
+
" <td>DNASE</td>\n",
|
| 365 |
+
" <td>frontal cortex male adult (27 years) and male ...</td>\n",
|
| 366 |
+
" </tr>\n",
|
| 367 |
+
" <tr>\n",
|
| 368 |
+
" <th>2</th>\n",
|
| 369 |
+
" <td>ENCFF880MKD</td>\n",
|
| 370 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 371 |
+
" <td>32</td>\n",
|
| 372 |
+
" <td>2</td>\n",
|
| 373 |
+
" <td>mean</td>\n",
|
| 374 |
+
" <td>DNASE:chorion</td>\n",
|
| 375 |
+
" <td>DNASE</td>\n",
|
| 376 |
+
" <td>chorion</td>\n",
|
| 377 |
+
" </tr>\n",
|
| 378 |
+
" <tr>\n",
|
| 379 |
+
" <th>3</th>\n",
|
| 380 |
+
" <td>ENCFF463ZLQ</td>\n",
|
| 381 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 382 |
+
" <td>32</td>\n",
|
| 383 |
+
" <td>2</td>\n",
|
| 384 |
+
" <td>mean</td>\n",
|
| 385 |
+
" <td>DNASE:Ishikawa treated with 0.02% dimethyl sul...</td>\n",
|
| 386 |
+
" <td>DNASE</td>\n",
|
| 387 |
+
" <td>Ishikawa treated with 0.02% dimethyl sulfoxide...</td>\n",
|
| 388 |
+
" </tr>\n",
|
| 389 |
+
" <tr>\n",
|
| 390 |
+
" <th>4</th>\n",
|
| 391 |
+
" <td>ENCFF890OGQ</td>\n",
|
| 392 |
+
" <td>/home/drk/tillage/datasets/human/dnase/encode/...</td>\n",
|
| 393 |
+
" <td>32</td>\n",
|
| 394 |
+
" <td>2</td>\n",
|
| 395 |
+
" <td>mean</td>\n",
|
| 396 |
+
" <td>DNASE:GM03348</td>\n",
|
| 397 |
+
" <td>DNASE</td>\n",
|
| 398 |
+
" <td>GM03348</td>\n",
|
| 399 |
+
" </tr>\n",
|
| 400 |
+
" </tbody>\n",
|
| 401 |
+
"</table>\n",
|
| 402 |
+
"</div>"
|
| 403 |
+
],
|
| 404 |
+
"text/plain": [
|
| 405 |
+
" name file clip \\\n",
|
| 406 |
+
"0 ENCFF833POA /home/drk/tillage/datasets/human/dnase/encode/... 32 \n",
|
| 407 |
+
"1 ENCFF110QGM /home/drk/tillage/datasets/human/dnase/encode/... 32 \n",
|
| 408 |
+
"2 ENCFF880MKD /home/drk/tillage/datasets/human/dnase/encode/... 32 \n",
|
| 409 |
+
"3 ENCFF463ZLQ /home/drk/tillage/datasets/human/dnase/encode/... 32 \n",
|
| 410 |
+
"4 ENCFF890OGQ /home/drk/tillage/datasets/human/dnase/encode/... 32 \n",
|
| 411 |
+
"\n",
|
| 412 |
+
" scale sum_stat description assay \\\n",
|
| 413 |
+
"0 2 mean DNASE:cerebellum male adult (27 years) and mal... DNASE \n",
|
| 414 |
+
"1 2 mean DNASE:frontal cortex male adult (27 years) and... DNASE \n",
|
| 415 |
+
"2 2 mean DNASE:chorion DNASE \n",
|
| 416 |
+
"3 2 mean DNASE:Ishikawa treated with 0.02% dimethyl sul... DNASE \n",
|
| 417 |
+
"4 2 mean DNASE:GM03348 DNASE \n",
|
| 418 |
+
"\n",
|
| 419 |
+
" sample \n",
|
| 420 |
+
"0 cerebellum male adult (27 years) and male adul... \n",
|
| 421 |
+
"1 frontal cortex male adult (27 years) and male ... \n",
|
| 422 |
+
"2 chorion \n",
|
| 423 |
+
"3 Ishikawa treated with 0.02% dimethyl sulfoxide... \n",
|
| 424 |
+
"4 GM03348 "
|
| 425 |
+
]
|
| 426 |
+
},
|
| 427 |
+
"execution_count": 17,
|
| 428 |
+
"metadata": {},
|
| 429 |
+
"output_type": "execute_result"
|
| 430 |
+
}
|
| 431 |
+
],
|
| 432 |
+
"source": [
|
| 433 |
+
"tasks = tasks.reset_index(drop=True)\n",
|
| 434 |
+
"tasks = tasks.drop(columns=[\"genome\"])\n",
|
| 435 |
+
"tasks[\"assay\"] = tasks[\"description\"].apply(lambda x: x.split(\":\")[0])\n",
|
| 436 |
+
"tasks[\"sample\"] = tasks[\"description\"].apply(lambda x: \":\".join(x.split(\":\")[1:]))\n",
|
| 437 |
+
"tasks = tasks.rename(columns={\"identifier\":\"name\"})\n",
|
| 438 |
+
"tasks.head()"
|
| 439 |
+
]
|
| 440 |
+
},
|
| 441 |
+
{
|
| 442 |
+
"cell_type": "code",
|
| 443 |
+
"execution_count": 18,
|
| 444 |
+
"id": "cc958ef5-f033-43df-b091-22e03021dd48",
|
| 445 |
+
"metadata": {},
|
| 446 |
+
"outputs": [],
|
| 447 |
+
"source": [
|
| 448 |
+
"tasks = tasks.to_dict(orient=\"list\")"
|
| 449 |
+
]
|
| 450 |
+
},
|
| 451 |
+
{
|
| 452 |
+
"cell_type": "markdown",
|
| 453 |
+
"id": "0caaa50e-a1dd-4d2a-ab17-06e545cef485",
|
| 454 |
+
"metadata": {},
|
| 455 |
+
"source": [
|
| 456 |
+
"## Process intervals"
|
| 457 |
+
]
|
| 458 |
+
},
|
| 459 |
+
{
|
| 460 |
+
"cell_type": "code",
|
| 461 |
+
"execution_count": 19,
|
| 462 |
+
"id": "456bf351-55aa-426d-a10d-0edc9b3591f4",
|
| 463 |
+
"metadata": {},
|
| 464 |
+
"outputs": [
|
| 465 |
+
{
|
| 466 |
+
"data": {
|
| 467 |
+
"text/html": [
|
| 468 |
+
"<div>\n",
|
| 469 |
+
"<style scoped>\n",
|
| 470 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 471 |
+
" vertical-align: middle;\n",
|
| 472 |
+
" }\n",
|
| 473 |
+
"\n",
|
| 474 |
+
" .dataframe tbody tr th {\n",
|
| 475 |
+
" vertical-align: top;\n",
|
| 476 |
+
" }\n",
|
| 477 |
+
"\n",
|
| 478 |
+
" .dataframe thead th {\n",
|
| 479 |
+
" text-align: right;\n",
|
| 480 |
+
" }\n",
|
| 481 |
+
"</style>\n",
|
| 482 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 483 |
+
" <thead>\n",
|
| 484 |
+
" <tr style=\"text-align: right;\">\n",
|
| 485 |
+
" <th></th>\n",
|
| 486 |
+
" <th>chrom</th>\n",
|
| 487 |
+
" <th>start</th>\n",
|
| 488 |
+
" <th>end</th>\n",
|
| 489 |
+
" <th>split</th>\n",
|
| 490 |
+
" </tr>\n",
|
| 491 |
+
" </thead>\n",
|
| 492 |
+
" <tbody>\n",
|
| 493 |
+
" <tr>\n",
|
| 494 |
+
" <th>0</th>\n",
|
| 495 |
+
" <td>chr18</td>\n",
|
| 496 |
+
" <td>928386</td>\n",
|
| 497 |
+
" <td>1059458</td>\n",
|
| 498 |
+
" <td>train</td>\n",
|
| 499 |
+
" </tr>\n",
|
| 500 |
+
" <tr>\n",
|
| 501 |
+
" <th>1</th>\n",
|
| 502 |
+
" <td>chr4</td>\n",
|
| 503 |
+
" <td>113630947</td>\n",
|
| 504 |
+
" <td>113762019</td>\n",
|
| 505 |
+
" <td>train</td>\n",
|
| 506 |
+
" </tr>\n",
|
| 507 |
+
" <tr>\n",
|
| 508 |
+
" <th>2</th>\n",
|
| 509 |
+
" <td>chr11</td>\n",
|
| 510 |
+
" <td>18427720</td>\n",
|
| 511 |
+
" <td>18558792</td>\n",
|
| 512 |
+
" <td>train</td>\n",
|
| 513 |
+
" </tr>\n",
|
| 514 |
+
" <tr>\n",
|
| 515 |
+
" <th>3</th>\n",
|
| 516 |
+
" <td>chr16</td>\n",
|
| 517 |
+
" <td>85805681</td>\n",
|
| 518 |
+
" <td>85936753</td>\n",
|
| 519 |
+
" <td>train</td>\n",
|
| 520 |
+
" </tr>\n",
|
| 521 |
+
" <tr>\n",
|
| 522 |
+
" <th>4</th>\n",
|
| 523 |
+
" <td>chr3</td>\n",
|
| 524 |
+
" <td>158386188</td>\n",
|
| 525 |
+
" <td>158517260</td>\n",
|
| 526 |
+
" <td>train</td>\n",
|
| 527 |
+
" </tr>\n",
|
| 528 |
+
" </tbody>\n",
|
| 529 |
+
"</table>\n",
|
| 530 |
+
"</div>"
|
| 531 |
+
],
|
| 532 |
+
"text/plain": [
|
| 533 |
+
" chrom start end split\n",
|
| 534 |
+
"0 chr18 928386 1059458 train\n",
|
| 535 |
+
"1 chr4 113630947 113762019 train\n",
|
| 536 |
+
"2 chr11 18427720 18558792 train\n",
|
| 537 |
+
"3 chr16 85805681 85936753 train\n",
|
| 538 |
+
"4 chr3 158386188 158517260 train"
|
| 539 |
+
]
|
| 540 |
+
},
|
| 541 |
+
"execution_count": 19,
|
| 542 |
+
"metadata": {},
|
| 543 |
+
"output_type": "execute_result"
|
| 544 |
+
}
|
| 545 |
+
],
|
| 546 |
+
"source": [
|
| 547 |
+
"intervals = pd.read_table(sequences_path, header=None)\n",
|
| 548 |
+
"intervals.columns = ['chrom', 'start', 'end', 'split']\n",
|
| 549 |
+
"intervals.head()"
|
| 550 |
+
]
|
| 551 |
+
},
|
| 552 |
+
{
|
| 553 |
+
"cell_type": "code",
|
| 554 |
+
"execution_count": 20,
|
| 555 |
+
"id": "1701f4c1-d7fe-42b6-80eb-323cb9755531",
|
| 556 |
+
"metadata": {},
|
| 557 |
+
"outputs": [
|
| 558 |
+
{
|
| 559 |
+
"data": {
|
| 560 |
+
"text/plain": [
|
| 561 |
+
"split\n",
|
| 562 |
+
"train 34021\n",
|
| 563 |
+
"valid 2213\n",
|
| 564 |
+
"test 1937\n",
|
| 565 |
+
"Name: count, dtype: int64"
|
| 566 |
+
]
|
| 567 |
+
},
|
| 568 |
+
"execution_count": 20,
|
| 569 |
+
"metadata": {},
|
| 570 |
+
"output_type": "execute_result"
|
| 571 |
+
}
|
| 572 |
+
],
|
| 573 |
+
"source": [
|
| 574 |
+
"intervals.split.value_counts()"
|
| 575 |
+
]
|
| 576 |
+
},
|
| 577 |
+
{
|
| 578 |
+
"cell_type": "code",
|
| 579 |
+
"execution_count": 21,
|
| 580 |
+
"id": "94298aa9-04cf-44e5-82c5-e1786207524d",
|
| 581 |
+
"metadata": {},
|
| 582 |
+
"outputs": [
|
| 583 |
+
{
|
| 584 |
+
"data": {
|
| 585 |
+
"text/plain": [
|
| 586 |
+
"131072"
|
| 587 |
+
]
|
| 588 |
+
},
|
| 589 |
+
"execution_count": 21,
|
| 590 |
+
"metadata": {},
|
| 591 |
+
"output_type": "execute_result"
|
| 592 |
+
}
|
| 593 |
+
],
|
| 594 |
+
"source": [
|
| 595 |
+
"get_unique_length(intervals)"
|
| 596 |
+
]
|
| 597 |
+
},
|
| 598 |
+
{
|
| 599 |
+
"cell_type": "code",
|
| 600 |
+
"execution_count": 22,
|
| 601 |
+
"id": "f4f97eb7-f087-46aa-a6c9-09ebbf5af775",
|
| 602 |
+
"metadata": {},
|
| 603 |
+
"outputs": [
|
| 604 |
+
{
|
| 605 |
+
"data": {
|
| 606 |
+
"text/html": [
|
| 607 |
+
"<div>\n",
|
| 608 |
+
"<style scoped>\n",
|
| 609 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 610 |
+
" vertical-align: middle;\n",
|
| 611 |
+
" }\n",
|
| 612 |
+
"\n",
|
| 613 |
+
" .dataframe tbody tr th {\n",
|
| 614 |
+
" vertical-align: top;\n",
|
| 615 |
+
" }\n",
|
| 616 |
+
"\n",
|
| 617 |
+
" .dataframe thead th {\n",
|
| 618 |
+
" text-align: right;\n",
|
| 619 |
+
" }\n",
|
| 620 |
+
"</style>\n",
|
| 621 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 622 |
+
" <thead>\n",
|
| 623 |
+
" <tr style=\"text-align: right;\">\n",
|
| 624 |
+
" <th></th>\n",
|
| 625 |
+
" <th>chrom</th>\n",
|
| 626 |
+
" <th>start</th>\n",
|
| 627 |
+
" <th>end</th>\n",
|
| 628 |
+
" <th>split</th>\n",
|
| 629 |
+
" </tr>\n",
|
| 630 |
+
" </thead>\n",
|
| 631 |
+
" <tbody>\n",
|
| 632 |
+
" <tr>\n",
|
| 633 |
+
" <th>0</th>\n",
|
| 634 |
+
" <td>chr18</td>\n",
|
| 635 |
+
" <td>895618</td>\n",
|
| 636 |
+
" <td>1092226</td>\n",
|
| 637 |
+
" <td>train</td>\n",
|
| 638 |
+
" </tr>\n",
|
| 639 |
+
" <tr>\n",
|
| 640 |
+
" <th>1</th>\n",
|
| 641 |
+
" <td>chr4</td>\n",
|
| 642 |
+
" <td>113598179</td>\n",
|
| 643 |
+
" <td>113794787</td>\n",
|
| 644 |
+
" <td>train</td>\n",
|
| 645 |
+
" </tr>\n",
|
| 646 |
+
" <tr>\n",
|
| 647 |
+
" <th>2</th>\n",
|
| 648 |
+
" <td>chr11</td>\n",
|
| 649 |
+
" <td>18394952</td>\n",
|
| 650 |
+
" <td>18591560</td>\n",
|
| 651 |
+
" <td>train</td>\n",
|
| 652 |
+
" </tr>\n",
|
| 653 |
+
" <tr>\n",
|
| 654 |
+
" <th>3</th>\n",
|
| 655 |
+
" <td>chr16</td>\n",
|
| 656 |
+
" <td>85772913</td>\n",
|
| 657 |
+
" <td>85969521</td>\n",
|
| 658 |
+
" <td>train</td>\n",
|
| 659 |
+
" </tr>\n",
|
| 660 |
+
" <tr>\n",
|
| 661 |
+
" <th>4</th>\n",
|
| 662 |
+
" <td>chr3</td>\n",
|
| 663 |
+
" <td>158353420</td>\n",
|
| 664 |
+
" <td>158550028</td>\n",
|
| 665 |
+
" <td>train</td>\n",
|
| 666 |
+
" </tr>\n",
|
| 667 |
+
" </tbody>\n",
|
| 668 |
+
"</table>\n",
|
| 669 |
+
"</div>"
|
| 670 |
+
],
|
| 671 |
+
"text/plain": [
|
| 672 |
+
" chrom start end split\n",
|
| 673 |
+
"0 chr18 895618 1092226 train\n",
|
| 674 |
+
"1 chr4 113598179 113794787 train\n",
|
| 675 |
+
"2 chr11 18394952 18591560 train\n",
|
| 676 |
+
"3 chr16 85772913 85969521 train\n",
|
| 677 |
+
"4 chr3 158353420 158550028 train"
|
| 678 |
+
]
|
| 679 |
+
},
|
| 680 |
+
"execution_count": 22,
|
| 681 |
+
"metadata": {},
|
| 682 |
+
"output_type": "execute_result"
|
| 683 |
+
}
|
| 684 |
+
],
|
| 685 |
+
"source": [
|
| 686 |
+
"intervals = resize(intervals, 196608)\n",
|
| 687 |
+
"intervals.head()"
|
| 688 |
+
]
|
| 689 |
+
},
|
| 690 |
+
{
|
| 691 |
+
"cell_type": "code",
|
| 692 |
+
"execution_count": 23,
|
| 693 |
+
"id": "9337f97b-8463-4f6a-bc51-23e83af46c89",
|
| 694 |
+
"metadata": {},
|
| 695 |
+
"outputs": [],
|
| 696 |
+
"source": [
|
| 697 |
+
"train_intervals = intervals[intervals.split=='train'].iloc[:, :3]\n",
|
| 698 |
+
"val_intervals = intervals[intervals.split=='valid'].iloc[:, :3]\n",
|
| 699 |
+
"test_intervals = intervals[intervals.split=='test'].iloc[:, :3]\n",
|
| 700 |
+
"del intervals"
|
| 701 |
+
]
|
| 702 |
+
},
|
| 703 |
+
{
|
| 704 |
+
"cell_type": "markdown",
|
| 705 |
+
"id": "62ecac6c-f031-4269-9672-50f57b883368",
|
| 706 |
+
"metadata": {},
|
| 707 |
+
"source": [
|
| 708 |
+
"## Initialize model"
|
| 709 |
+
]
|
| 710 |
+
},
|
| 711 |
+
{
|
| 712 |
+
"cell_type": "code",
|
| 713 |
+
"execution_count": 24,
|
| 714 |
+
"id": "a5f3af07-c670-4ceb-abc1-891f45b298d0",
|
| 715 |
+
"metadata": {},
|
| 716 |
+
"outputs": [],
|
| 717 |
+
"source": [
|
| 718 |
+
"model_params={\n",
|
| 719 |
+
" 'model_type':'EnformerModel',\n",
|
| 720 |
+
" 'final_act_func': 'softplus',\n",
|
| 721 |
+
" 'final_pool_func':None,\n",
|
| 722 |
+
" 'n_tasks': 5313,\n",
|
| 723 |
+
" 'crop_len':320,\n",
|
| 724 |
+
"}\n",
|
| 725 |
+
"train_params={'task':'regression', 'loss':'mse'}\n",
|
| 726 |
+
"\n",
|
| 727 |
+
"model = LightningModel(model_params, train_params)"
|
| 728 |
+
]
|
| 729 |
+
},
|
| 730 |
+
{
|
| 731 |
+
"cell_type": "markdown",
|
| 732 |
+
"id": "4e6981d5-fb40-486a-bb34-db296ac14778",
|
| 733 |
+
"metadata": {},
|
| 734 |
+
"source": [
|
| 735 |
+
"## Load weights"
|
| 736 |
+
]
|
| 737 |
+
},
|
| 738 |
+
{
|
| 739 |
+
"cell_type": "code",
|
| 740 |
+
"execution_count": 25,
|
| 741 |
+
"id": "08cef7a5-3a9c-4342-8d82-2eabf8eac4d6",
|
| 742 |
+
"metadata": {},
|
| 743 |
+
"outputs": [
|
| 744 |
+
{
|
| 745 |
+
"name": "stderr",
|
| 746 |
+
"output_type": "stream",
|
| 747 |
+
"text": [
|
| 748 |
+
"/tmp/ipykernel_3296127/1230005423.py:1: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 749 |
+
" state_dict = torch.load(\"/data/enformer/torch_weights/human.h5\")\n"
|
| 750 |
+
]
|
| 751 |
+
},
|
| 752 |
+
{
|
| 753 |
+
"data": {
|
| 754 |
+
"text/plain": [
|
| 755 |
+
"<All keys matched successfully>"
|
| 756 |
+
]
|
| 757 |
+
},
|
| 758 |
+
"execution_count": 25,
|
| 759 |
+
"metadata": {},
|
| 760 |
+
"output_type": "execute_result"
|
| 761 |
+
}
|
| 762 |
+
],
|
| 763 |
+
"source": [
|
| 764 |
+
"state_dict = torch.load(\"/data/enformer/torch_weights/human.h5\")\n",
|
| 765 |
+
"model.model.load_state_dict(state_dict)"
|
| 766 |
+
]
|
| 767 |
+
},
|
| 768 |
+
{
|
| 769 |
+
"cell_type": "markdown",
|
| 770 |
+
"id": "9b11fb1f-2d53-4b08-ae6e-15c3dfa1c1b1",
|
| 771 |
+
"metadata": {},
|
| 772 |
+
"source": [
|
| 773 |
+
"## Add hparams"
|
| 774 |
+
]
|
| 775 |
+
},
|
| 776 |
+
{
|
| 777 |
+
"cell_type": "code",
|
| 778 |
+
"execution_count": 26,
|
| 779 |
+
"id": "9fd166f4-998b-4ff1-aabd-fe3b12fd3b82",
|
| 780 |
+
"metadata": {},
|
| 781 |
+
"outputs": [],
|
| 782 |
+
"source": [
|
| 783 |
+
"model.data_params[\"train\"] = dict()\n",
|
| 784 |
+
"model.data_params[\"val\"] = dict()\n",
|
| 785 |
+
"model.data_params[\"test\"] = dict()"
|
| 786 |
+
]
|
| 787 |
+
},
|
| 788 |
+
{
|
| 789 |
+
"cell_type": "code",
|
| 790 |
+
"execution_count": 27,
|
| 791 |
+
"id": "09cbfeb5-9f73-44f8-81c5-7f5db0ea577e",
|
| 792 |
+
"metadata": {},
|
| 793 |
+
"outputs": [],
|
| 794 |
+
"source": [
|
| 795 |
+
"model.data_params[\"train\"][\"seq_len\"] = 196608\n",
|
| 796 |
+
"model.data_params[\"train\"][\"label_len\"] = 896 * 128\n",
|
| 797 |
+
"model.data_params[\"train\"][\"genome\"] = \"hg38\"\n",
|
| 798 |
+
"model.data_params[\"train\"][\"bin_size\"] = 128\n",
|
| 799 |
+
"model.data_params[\"train\"][\"max_seq_shift\"] = 3\n",
|
| 800 |
+
"model.data_params[\"train\"][\"rc\"] = True"
|
| 801 |
+
]
|
| 802 |
+
},
|
| 803 |
+
{
|
| 804 |
+
"cell_type": "markdown",
|
| 805 |
+
"id": "208f6727-0540-43dd-8554-4036b8e57180",
|
| 806 |
+
"metadata": {},
|
| 807 |
+
"source": [
|
| 808 |
+
"## Add tasks"
|
| 809 |
+
]
|
| 810 |
+
},
|
| 811 |
+
{
|
| 812 |
+
"cell_type": "code",
|
| 813 |
+
"execution_count": 28,
|
| 814 |
+
"id": "ba516087-edd9-46b8-9e76-44ff35d6e553",
|
| 815 |
+
"metadata": {},
|
| 816 |
+
"outputs": [],
|
| 817 |
+
"source": [
|
| 818 |
+
"model.data_params[\"tasks\"] = tasks"
|
| 819 |
+
]
|
| 820 |
+
},
|
| 821 |
+
{
|
| 822 |
+
"cell_type": "markdown",
|
| 823 |
+
"id": "ab5ef3dd-aaa4-4322-a25f-d6b818efb893",
|
| 824 |
+
"metadata": {},
|
| 825 |
+
"source": [
|
| 826 |
+
"## Add intervals"
|
| 827 |
+
]
|
| 828 |
+
},
|
| 829 |
+
{
|
| 830 |
+
"cell_type": "code",
|
| 831 |
+
"execution_count": 29,
|
| 832 |
+
"id": "be3564f3-2b4d-4618-9277-faceab43b7cd",
|
| 833 |
+
"metadata": {},
|
| 834 |
+
"outputs": [],
|
| 835 |
+
"source": [
|
| 836 |
+
"model.data_params[\"train\"][\"intervals\"] = train_intervals.to_dict(orient='list')\n",
|
| 837 |
+
"model.data_params[\"val\"][\"intervals\"] = val_intervals.to_dict(orient='list')\n",
|
| 838 |
+
"model.data_params[\"test\"][\"intervals\"] = test_intervals.to_dict(orient='list')"
|
| 839 |
+
]
|
| 840 |
+
},
|
| 841 |
+
{
|
| 842 |
+
"cell_type": "markdown",
|
| 843 |
+
"id": "d8471c54-8716-41de-84b0-212ba7419adb",
|
| 844 |
+
"metadata": {},
|
| 845 |
+
"source": [
|
| 846 |
+
"## Save"
|
| 847 |
+
]
|
| 848 |
+
},
|
| 849 |
+
{
|
| 850 |
+
"cell_type": "code",
|
| 851 |
+
"execution_count": 30,
|
| 852 |
+
"id": "7c4e3973-40e4-46c7-9ac9-8985d82498e1",
|
| 853 |
+
"metadata": {},
|
| 854 |
+
"outputs": [
|
| 855 |
+
{
|
| 856 |
+
"name": "stderr",
|
| 857 |
+
"output_type": "stream",
|
| 858 |
+
"text": [
|
| 859 |
+
"Trainer will use only 1 of 8 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=8)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.\n",
|
| 860 |
+
"GPU available: True (cuda), used: True\n",
|
| 861 |
+
"TPU available: False, using: 0 TPU cores\n",
|
| 862 |
+
"HPU available: False, using: 0 HPUs\n",
|
| 863 |
+
"You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
|
| 864 |
+
"LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n"
|
| 865 |
+
]
|
| 866 |
+
}
|
| 867 |
+
],
|
| 868 |
+
"source": [
|
| 869 |
+
"trainer = pl.Trainer()\n",
|
| 870 |
+
"try:\n",
|
| 871 |
+
" trainer.predict(model) \n",
|
| 872 |
+
"except:\n",
|
| 873 |
+
" trainer.save_checkpoint('/data/enformer/torch_weights/human.ckpt')"
|
| 874 |
+
]
|
| 875 |
+
},
|
| 876 |
+
{
|
| 877 |
+
"cell_type": "markdown",
|
| 878 |
+
"id": "5dd274ed-ba38-41ab-8288-ca848c33b539",
|
| 879 |
+
"metadata": {},
|
| 880 |
+
"source": [
|
| 881 |
+
"## Upload"
|
| 882 |
+
]
|
| 883 |
+
},
|
| 884 |
+
{
|
| 885 |
+
"cell_type": "code",
|
| 886 |
+
"execution_count": 31,
|
| 887 |
+
"id": "4f460b22-9da7-40f8-bd77-0c330a2b1111",
|
| 888 |
+
"metadata": {},
|
| 889 |
+
"outputs": [
|
| 890 |
+
{
|
| 891 |
+
"name": "stderr",
|
| 892 |
+
"output_type": "stream",
|
| 893 |
+
"text": [
|
| 894 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 277336 bytes\n",
|
| 895 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 277336 bytes\n",
|
| 896 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 277336 bytes\n",
|
| 897 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 277336 bytes\n",
|
| 898 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 277336 bytes\n",
|
| 899 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 277336 bytes\n"
|
| 900 |
+
]
|
| 901 |
+
},
|
| 902 |
+
{
|
| 903 |
+
"data": {
|
| 904 |
+
"text/plain": [
|
| 905 |
+
"<Artifact human>"
|
| 906 |
+
]
|
| 907 |
+
},
|
| 908 |
+
"execution_count": 31,
|
| 909 |
+
"metadata": {},
|
| 910 |
+
"output_type": "execute_result"
|
| 911 |
+
}
|
| 912 |
+
],
|
| 913 |
+
"source": [
|
| 914 |
+
"artifact = wandb.Artifact(\n",
|
| 915 |
+
" 'human', \n",
|
| 916 |
+
" type='model',\n",
|
| 917 |
+
" metadata={\n",
|
| 918 |
+
" 'model_params':model.model_params, \n",
|
| 919 |
+
" 'train_params':model.train_params, \n",
|
| 920 |
+
" 'data_params':model.data_params\n",
|
| 921 |
+
" }\n",
|
| 922 |
+
")\n",
|
| 923 |
+
"artifact.add_file(local_path='/data/enformer/torch_weights/human.ckpt', name='model.ckpt')\n",
|
| 924 |
+
"run.log_artifact(artifact)"
|
| 925 |
+
]
|
| 926 |
+
},
|
| 927 |
+
{
|
| 928 |
+
"cell_type": "code",
|
| 929 |
+
"execution_count": 32,
|
| 930 |
+
"id": "9269df1a-b6d4-410d-bbe6-39a4b70576cc",
|
| 931 |
+
"metadata": {},
|
| 932 |
+
"outputs": [
|
| 933 |
+
{
|
| 934 |
+
"data": {
|
| 935 |
+
"text/html": [],
|
| 936 |
+
"text/plain": [
|
| 937 |
+
"<IPython.core.display.HTML object>"
|
| 938 |
+
]
|
| 939 |
+
},
|
| 940 |
+
"metadata": {},
|
| 941 |
+
"output_type": "display_data"
|
| 942 |
+
},
|
| 943 |
+
{
|
| 944 |
+
"data": {
|
| 945 |
+
"text/html": [
|
| 946 |
+
" View run <strong style=\"color:#cdcd00\">copy-human</strong> at: <a href='https://wandb.ai/grelu/enformer/runs/rwin6f0o' target=\"_blank\">https://wandb.ai/grelu/enformer/runs/rwin6f0o</a><br> View project at: <a href='https://wandb.ai/grelu/enformer' target=\"_blank\">https://wandb.ai/grelu/enformer</a><br>Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)"
|
| 947 |
+
],
|
| 948 |
+
"text/plain": [
|
| 949 |
+
"<IPython.core.display.HTML object>"
|
| 950 |
+
]
|
| 951 |
+
},
|
| 952 |
+
"metadata": {},
|
| 953 |
+
"output_type": "display_data"
|
| 954 |
+
},
|
| 955 |
+
{
|
| 956 |
+
"data": {
|
| 957 |
+
"text/html": [
|
| 958 |
+
"Find logs at: <code>./wandb/run-20250304_222603-rwin6f0o/logs</code>"
|
| 959 |
+
],
|
| 960 |
+
"text/plain": [
|
| 961 |
+
"<IPython.core.display.HTML object>"
|
| 962 |
+
]
|
| 963 |
+
},
|
| 964 |
+
"metadata": {},
|
| 965 |
+
"output_type": "display_data"
|
| 966 |
+
}
|
| 967 |
+
],
|
| 968 |
+
"source": [
|
| 969 |
+
"run.finish() "
|
| 970 |
+
]
|
| 971 |
+
},
|
| 972 |
+
{
|
| 973 |
+
"cell_type": "code",
|
| 974 |
+
"execution_count": null,
|
| 975 |
+
"id": "55eef443-2d99-443f-a31a-d1aee4a31afe",
|
| 976 |
+
"metadata": {},
|
| 977 |
+
"outputs": [],
|
| 978 |
+
"source": []
|
| 979 |
+
}
|
| 980 |
+
],
|
| 981 |
+
"metadata": {
|
| 982 |
+
"kernelspec": {
|
| 983 |
+
"display_name": "Python 3 (ipykernel)",
|
| 984 |
+
"language": "python",
|
| 985 |
+
"name": "python3"
|
| 986 |
+
},
|
| 987 |
+
"language_info": {
|
| 988 |
+
"codemirror_mode": {
|
| 989 |
+
"name": "ipython",
|
| 990 |
+
"version": 3
|
| 991 |
+
},
|
| 992 |
+
"file_extension": ".py",
|
| 993 |
+
"mimetype": "text/x-python",
|
| 994 |
+
"name": "python",
|
| 995 |
+
"nbconvert_exporter": "python",
|
| 996 |
+
"pygments_lexer": "ipython3",
|
| 997 |
+
"version": "3.11.9"
|
| 998 |
+
}
|
| 999 |
+
},
|
| 1000 |
+
"nbformat": 4,
|
| 1001 |
+
"nbformat_minor": 5
|
| 1002 |
+
}
|
save_wandb_enformer_mouse.ipynb
ADDED
|
@@ -0,0 +1,979 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": 1,
|
| 6 |
+
"id": "0b100814-a834-4c18-abbe-9cab6ea1278c",
|
| 7 |
+
"metadata": {},
|
| 8 |
+
"outputs": [
|
| 9 |
+
{
|
| 10 |
+
"name": "stderr",
|
| 11 |
+
"output_type": "stream",
|
| 12 |
+
"text": [
|
| 13 |
+
"/opt/conda/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 14 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 15 |
+
]
|
| 16 |
+
}
|
| 17 |
+
],
|
| 18 |
+
"source": [
|
| 19 |
+
"import wandb\n",
|
| 20 |
+
"import torch\n",
|
| 21 |
+
"import pandas as pd\n",
|
| 22 |
+
"\n",
|
| 23 |
+
"from grelu.lightning import LightningModel\n",
|
| 24 |
+
"import pytorch_lightning as pl\n",
|
| 25 |
+
"from grelu.sequence.utils import get_unique_length, resize"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "markdown",
|
| 30 |
+
"id": "cb22e3f0-8ef7-41f3-aefc-4a2d182af5ba",
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"source": [
|
| 33 |
+
"## wandb login"
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"cell_type": "code",
|
| 38 |
+
"execution_count": 2,
|
| 39 |
+
"id": "137c5253-351e-4945-88a3-a4b7c555326c",
|
| 40 |
+
"metadata": {},
|
| 41 |
+
"outputs": [
|
| 42 |
+
{
|
| 43 |
+
"name": "stderr",
|
| 44 |
+
"output_type": "stream",
|
| 45 |
+
"text": [
|
| 46 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.\n",
|
| 47 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mavantikalal\u001b[0m (\u001b[33mgrelu\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n"
|
| 48 |
+
]
|
| 49 |
+
},
|
| 50 |
+
{
|
| 51 |
+
"data": {
|
| 52 |
+
"text/plain": [
|
| 53 |
+
"True"
|
| 54 |
+
]
|
| 55 |
+
},
|
| 56 |
+
"execution_count": 2,
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"output_type": "execute_result"
|
| 59 |
+
}
|
| 60 |
+
],
|
| 61 |
+
"source": [
|
| 62 |
+
"wandb.login(host=\"https://api.wandb.ai\")"
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "code",
|
| 67 |
+
"execution_count": 3,
|
| 68 |
+
"id": "b5be7b96-5f3c-4073-b152-22ae3e46db06",
|
| 69 |
+
"metadata": {},
|
| 70 |
+
"outputs": [
|
| 71 |
+
{
|
| 72 |
+
"data": {
|
| 73 |
+
"text/html": [
|
| 74 |
+
"Tracking run with wandb version 0.19.7"
|
| 75 |
+
],
|
| 76 |
+
"text/plain": [
|
| 77 |
+
"<IPython.core.display.HTML object>"
|
| 78 |
+
]
|
| 79 |
+
},
|
| 80 |
+
"metadata": {},
|
| 81 |
+
"output_type": "display_data"
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"data": {
|
| 85 |
+
"text/html": [
|
| 86 |
+
"Run data is saved locally in <code>/code/github/gReLU-applications/enformer/wandb/run-20250304_222920-jrgxfvad</code>"
|
| 87 |
+
],
|
| 88 |
+
"text/plain": [
|
| 89 |
+
"<IPython.core.display.HTML object>"
|
| 90 |
+
]
|
| 91 |
+
},
|
| 92 |
+
"metadata": {},
|
| 93 |
+
"output_type": "display_data"
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"data": {
|
| 97 |
+
"text/html": [
|
| 98 |
+
"Syncing run <strong><a href='https://wandb.ai/grelu/enformer/runs/jrgxfvad' target=\"_blank\">copy-mouse</a></strong> to <a href='https://wandb.ai/grelu/enformer' target=\"_blank\">Weights & Biases</a> (<a href='https://wandb.me/developer-guide' target=\"_blank\">docs</a>)<br>"
|
| 99 |
+
],
|
| 100 |
+
"text/plain": [
|
| 101 |
+
"<IPython.core.display.HTML object>"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
"metadata": {},
|
| 105 |
+
"output_type": "display_data"
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"data": {
|
| 109 |
+
"text/html": [
|
| 110 |
+
" View project at <a href='https://wandb.ai/grelu/enformer' target=\"_blank\">https://wandb.ai/grelu/enformer</a>"
|
| 111 |
+
],
|
| 112 |
+
"text/plain": [
|
| 113 |
+
"<IPython.core.display.HTML object>"
|
| 114 |
+
]
|
| 115 |
+
},
|
| 116 |
+
"metadata": {},
|
| 117 |
+
"output_type": "display_data"
|
| 118 |
+
},
|
| 119 |
+
{
|
| 120 |
+
"data": {
|
| 121 |
+
"text/html": [
|
| 122 |
+
" View run at <a href='https://wandb.ai/grelu/enformer/runs/jrgxfvad' target=\"_blank\">https://wandb.ai/grelu/enformer/runs/jrgxfvad</a>"
|
| 123 |
+
],
|
| 124 |
+
"text/plain": [
|
| 125 |
+
"<IPython.core.display.HTML object>"
|
| 126 |
+
]
|
| 127 |
+
},
|
| 128 |
+
"metadata": {},
|
| 129 |
+
"output_type": "display_data"
|
| 130 |
+
}
|
| 131 |
+
],
|
| 132 |
+
"source": [
|
| 133 |
+
"run = wandb.init(entity='grelu', project='enformer', job_type='copy', name='copy-mouse') # Initialize a W&B Run"
|
| 134 |
+
]
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"cell_type": "markdown",
|
| 138 |
+
"id": "4564271b-2a71-47e3-bfc4-9b0f4c616a73",
|
| 139 |
+
"metadata": {},
|
| 140 |
+
"source": [
|
| 141 |
+
"## Paths"
|
| 142 |
+
]
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"cell_type": "code",
|
| 146 |
+
"execution_count": 4,
|
| 147 |
+
"id": "86052145-d3a9-4f6a-9b23-8353b5dd38e1",
|
| 148 |
+
"metadata": {},
|
| 149 |
+
"outputs": [],
|
| 150 |
+
"source": [
|
| 151 |
+
"targets_path = 'https://raw.githubusercontent.com/calico/basenji/master/manuscripts/cross2020/targets_mouse.txt'"
|
| 152 |
+
]
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
"cell_type": "code",
|
| 156 |
+
"execution_count": 5,
|
| 157 |
+
"id": "97aed0b6-8b61-424a-8366-37c204b061d8",
|
| 158 |
+
"metadata": {},
|
| 159 |
+
"outputs": [],
|
| 160 |
+
"source": [
|
| 161 |
+
"sequences_path = '/gstore/data/resbioai/grelu/enformer/sequences-mouse.bed'"
|
| 162 |
+
]
|
| 163 |
+
},
|
| 164 |
+
{
|
| 165 |
+
"cell_type": "markdown",
|
| 166 |
+
"id": "915c74f7-15de-4ae6-a9c9-0142fa885f1e",
|
| 167 |
+
"metadata": {},
|
| 168 |
+
"source": [
|
| 169 |
+
"## Process tasks"
|
| 170 |
+
]
|
| 171 |
+
},
|
| 172 |
+
{
|
| 173 |
+
"cell_type": "code",
|
| 174 |
+
"execution_count": 6,
|
| 175 |
+
"id": "6701b16c-e750-4eb6-b2e0-1223d49cc103",
|
| 176 |
+
"metadata": {},
|
| 177 |
+
"outputs": [
|
| 178 |
+
{
|
| 179 |
+
"name": "stdout",
|
| 180 |
+
"output_type": "stream",
|
| 181 |
+
"text": [
|
| 182 |
+
"1643\n"
|
| 183 |
+
]
|
| 184 |
+
},
|
| 185 |
+
{
|
| 186 |
+
"data": {
|
| 187 |
+
"text/html": [
|
| 188 |
+
"<div>\n",
|
| 189 |
+
"<style scoped>\n",
|
| 190 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 191 |
+
" vertical-align: middle;\n",
|
| 192 |
+
" }\n",
|
| 193 |
+
"\n",
|
| 194 |
+
" .dataframe tbody tr th {\n",
|
| 195 |
+
" vertical-align: top;\n",
|
| 196 |
+
" }\n",
|
| 197 |
+
"\n",
|
| 198 |
+
" .dataframe thead th {\n",
|
| 199 |
+
" text-align: right;\n",
|
| 200 |
+
" }\n",
|
| 201 |
+
"</style>\n",
|
| 202 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 203 |
+
" <thead>\n",
|
| 204 |
+
" <tr style=\"text-align: right;\">\n",
|
| 205 |
+
" <th></th>\n",
|
| 206 |
+
" <th>genome</th>\n",
|
| 207 |
+
" <th>identifier</th>\n",
|
| 208 |
+
" <th>file</th>\n",
|
| 209 |
+
" <th>clip</th>\n",
|
| 210 |
+
" <th>scale</th>\n",
|
| 211 |
+
" <th>sum_stat</th>\n",
|
| 212 |
+
" <th>description</th>\n",
|
| 213 |
+
" </tr>\n",
|
| 214 |
+
" <tr>\n",
|
| 215 |
+
" <th>index</th>\n",
|
| 216 |
+
" <th></th>\n",
|
| 217 |
+
" <th></th>\n",
|
| 218 |
+
" <th></th>\n",
|
| 219 |
+
" <th></th>\n",
|
| 220 |
+
" <th></th>\n",
|
| 221 |
+
" <th></th>\n",
|
| 222 |
+
" <th></th>\n",
|
| 223 |
+
" </tr>\n",
|
| 224 |
+
" </thead>\n",
|
| 225 |
+
" <tbody>\n",
|
| 226 |
+
" <tr>\n",
|
| 227 |
+
" <th>5313</th>\n",
|
| 228 |
+
" <td>1</td>\n",
|
| 229 |
+
" <td>ENCFF866ZTV</td>\n",
|
| 230 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 231 |
+
" <td>32</td>\n",
|
| 232 |
+
" <td>2</td>\n",
|
| 233 |
+
" <td>mean</td>\n",
|
| 234 |
+
" <td>DNASE:B6D2F1/J 416B</td>\n",
|
| 235 |
+
" </tr>\n",
|
| 236 |
+
" <tr>\n",
|
| 237 |
+
" <th>5314</th>\n",
|
| 238 |
+
" <td>1</td>\n",
|
| 239 |
+
" <td>ENCFF695LHM</td>\n",
|
| 240 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 241 |
+
" <td>32</td>\n",
|
| 242 |
+
" <td>2</td>\n",
|
| 243 |
+
" <td>mean</td>\n",
|
| 244 |
+
" <td>DNASE:BALB/cAnN A20</td>\n",
|
| 245 |
+
" </tr>\n",
|
| 246 |
+
" <tr>\n",
|
| 247 |
+
" <th>5315</th>\n",
|
| 248 |
+
" <td>1</td>\n",
|
| 249 |
+
" <td>ENCFF079SPZ</td>\n",
|
| 250 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 251 |
+
" <td>32</td>\n",
|
| 252 |
+
" <td>2</td>\n",
|
| 253 |
+
" <td>mean</td>\n",
|
| 254 |
+
" <td>DNASE:C57BL/6 B cell male adult (8 weeks)</td>\n",
|
| 255 |
+
" </tr>\n",
|
| 256 |
+
" </tbody>\n",
|
| 257 |
+
"</table>\n",
|
| 258 |
+
"</div>"
|
| 259 |
+
],
|
| 260 |
+
"text/plain": [
|
| 261 |
+
" genome identifier file \\\n",
|
| 262 |
+
"index \n",
|
| 263 |
+
"5313 1 ENCFF866ZTV /home/drk/tillage/datasets/mouse/dnase/encode/... \n",
|
| 264 |
+
"5314 1 ENCFF695LHM /home/drk/tillage/datasets/mouse/dnase/encode/... \n",
|
| 265 |
+
"5315 1 ENCFF079SPZ /home/drk/tillage/datasets/mouse/dnase/encode/... \n",
|
| 266 |
+
"\n",
|
| 267 |
+
" clip scale sum_stat description \n",
|
| 268 |
+
"index \n",
|
| 269 |
+
"5313 32 2 mean DNASE:B6D2F1/J 416B \n",
|
| 270 |
+
"5314 32 2 mean DNASE:BALB/cAnN A20 \n",
|
| 271 |
+
"5315 32 2 mean DNASE:C57BL/6 B cell male adult (8 weeks) "
|
| 272 |
+
]
|
| 273 |
+
},
|
| 274 |
+
"execution_count": 6,
|
| 275 |
+
"metadata": {},
|
| 276 |
+
"output_type": "execute_result"
|
| 277 |
+
}
|
| 278 |
+
],
|
| 279 |
+
"source": [
|
| 280 |
+
"tasks = pd.read_csv(targets_path, sep='\\t', index_col=0)\n",
|
| 281 |
+
"print(len(tasks))\n",
|
| 282 |
+
"tasks.head(3)"
|
| 283 |
+
]
|
| 284 |
+
},
|
| 285 |
+
{
|
| 286 |
+
"cell_type": "code",
|
| 287 |
+
"execution_count": 7,
|
| 288 |
+
"id": "cc958ef5-f033-43df-b091-22e03021dd48",
|
| 289 |
+
"metadata": {},
|
| 290 |
+
"outputs": [
|
| 291 |
+
{
|
| 292 |
+
"data": {
|
| 293 |
+
"text/html": [
|
| 294 |
+
"<div>\n",
|
| 295 |
+
"<style scoped>\n",
|
| 296 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 297 |
+
" vertical-align: middle;\n",
|
| 298 |
+
" }\n",
|
| 299 |
+
"\n",
|
| 300 |
+
" .dataframe tbody tr th {\n",
|
| 301 |
+
" vertical-align: top;\n",
|
| 302 |
+
" }\n",
|
| 303 |
+
"\n",
|
| 304 |
+
" .dataframe thead th {\n",
|
| 305 |
+
" text-align: right;\n",
|
| 306 |
+
" }\n",
|
| 307 |
+
"</style>\n",
|
| 308 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 309 |
+
" <thead>\n",
|
| 310 |
+
" <tr style=\"text-align: right;\">\n",
|
| 311 |
+
" <th></th>\n",
|
| 312 |
+
" <th>name</th>\n",
|
| 313 |
+
" <th>file</th>\n",
|
| 314 |
+
" <th>clip</th>\n",
|
| 315 |
+
" <th>scale</th>\n",
|
| 316 |
+
" <th>sum_stat</th>\n",
|
| 317 |
+
" <th>description</th>\n",
|
| 318 |
+
" <th>assay</th>\n",
|
| 319 |
+
" <th>sample</th>\n",
|
| 320 |
+
" </tr>\n",
|
| 321 |
+
" </thead>\n",
|
| 322 |
+
" <tbody>\n",
|
| 323 |
+
" <tr>\n",
|
| 324 |
+
" <th>0</th>\n",
|
| 325 |
+
" <td>ENCFF866ZTV</td>\n",
|
| 326 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 327 |
+
" <td>32</td>\n",
|
| 328 |
+
" <td>2</td>\n",
|
| 329 |
+
" <td>mean</td>\n",
|
| 330 |
+
" <td>DNASE:B6D2F1/J 416B</td>\n",
|
| 331 |
+
" <td>DNASE</td>\n",
|
| 332 |
+
" <td>B6D2F1/J 416B</td>\n",
|
| 333 |
+
" </tr>\n",
|
| 334 |
+
" <tr>\n",
|
| 335 |
+
" <th>1</th>\n",
|
| 336 |
+
" <td>ENCFF695LHM</td>\n",
|
| 337 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 338 |
+
" <td>32</td>\n",
|
| 339 |
+
" <td>2</td>\n",
|
| 340 |
+
" <td>mean</td>\n",
|
| 341 |
+
" <td>DNASE:BALB/cAnN A20</td>\n",
|
| 342 |
+
" <td>DNASE</td>\n",
|
| 343 |
+
" <td>BALB/cAnN A20</td>\n",
|
| 344 |
+
" </tr>\n",
|
| 345 |
+
" <tr>\n",
|
| 346 |
+
" <th>2</th>\n",
|
| 347 |
+
" <td>ENCFF079SPZ</td>\n",
|
| 348 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 349 |
+
" <td>32</td>\n",
|
| 350 |
+
" <td>2</td>\n",
|
| 351 |
+
" <td>mean</td>\n",
|
| 352 |
+
" <td>DNASE:C57BL/6 B cell male adult (8 weeks)</td>\n",
|
| 353 |
+
" <td>DNASE</td>\n",
|
| 354 |
+
" <td>C57BL/6 B cell male adult (8 weeks)</td>\n",
|
| 355 |
+
" </tr>\n",
|
| 356 |
+
" <tr>\n",
|
| 357 |
+
" <th>3</th>\n",
|
| 358 |
+
" <td>ENCFF798VSP</td>\n",
|
| 359 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 360 |
+
" <td>32</td>\n",
|
| 361 |
+
" <td>2</td>\n",
|
| 362 |
+
" <td>mean</td>\n",
|
| 363 |
+
" <td>DNASE:C57BL/6 splenic B cell male adult (8 weeks)</td>\n",
|
| 364 |
+
" <td>DNASE</td>\n",
|
| 365 |
+
" <td>C57BL/6 splenic B cell male adult (8 weeks)</td>\n",
|
| 366 |
+
" </tr>\n",
|
| 367 |
+
" <tr>\n",
|
| 368 |
+
" <th>4</th>\n",
|
| 369 |
+
" <td>ENCFF474GND</td>\n",
|
| 370 |
+
" <td>/home/drk/tillage/datasets/mouse/dnase/encode/...</td>\n",
|
| 371 |
+
" <td>32</td>\n",
|
| 372 |
+
" <td>2</td>\n",
|
| 373 |
+
" <td>mean</td>\n",
|
| 374 |
+
" <td>DNASE:C57BL/6 cerebellum male adult (8 weeks)</td>\n",
|
| 375 |
+
" <td>DNASE</td>\n",
|
| 376 |
+
" <td>C57BL/6 cerebellum male adult (8 weeks)</td>\n",
|
| 377 |
+
" </tr>\n",
|
| 378 |
+
" </tbody>\n",
|
| 379 |
+
"</table>\n",
|
| 380 |
+
"</div>"
|
| 381 |
+
],
|
| 382 |
+
"text/plain": [
|
| 383 |
+
" name file clip \\\n",
|
| 384 |
+
"0 ENCFF866ZTV /home/drk/tillage/datasets/mouse/dnase/encode/... 32 \n",
|
| 385 |
+
"1 ENCFF695LHM /home/drk/tillage/datasets/mouse/dnase/encode/... 32 \n",
|
| 386 |
+
"2 ENCFF079SPZ /home/drk/tillage/datasets/mouse/dnase/encode/... 32 \n",
|
| 387 |
+
"3 ENCFF798VSP /home/drk/tillage/datasets/mouse/dnase/encode/... 32 \n",
|
| 388 |
+
"4 ENCFF474GND /home/drk/tillage/datasets/mouse/dnase/encode/... 32 \n",
|
| 389 |
+
"\n",
|
| 390 |
+
" scale sum_stat description assay \\\n",
|
| 391 |
+
"0 2 mean DNASE:B6D2F1/J 416B DNASE \n",
|
| 392 |
+
"1 2 mean DNASE:BALB/cAnN A20 DNASE \n",
|
| 393 |
+
"2 2 mean DNASE:C57BL/6 B cell male adult (8 weeks) DNASE \n",
|
| 394 |
+
"3 2 mean DNASE:C57BL/6 splenic B cell male adult (8 weeks) DNASE \n",
|
| 395 |
+
"4 2 mean DNASE:C57BL/6 cerebellum male adult (8 weeks) DNASE \n",
|
| 396 |
+
"\n",
|
| 397 |
+
" sample \n",
|
| 398 |
+
"0 B6D2F1/J 416B \n",
|
| 399 |
+
"1 BALB/cAnN A20 \n",
|
| 400 |
+
"2 C57BL/6 B cell male adult (8 weeks) \n",
|
| 401 |
+
"3 C57BL/6 splenic B cell male adult (8 weeks) \n",
|
| 402 |
+
"4 C57BL/6 cerebellum male adult (8 weeks) "
|
| 403 |
+
]
|
| 404 |
+
},
|
| 405 |
+
"execution_count": 7,
|
| 406 |
+
"metadata": {},
|
| 407 |
+
"output_type": "execute_result"
|
| 408 |
+
}
|
| 409 |
+
],
|
| 410 |
+
"source": [
|
| 411 |
+
"tasks = tasks.reset_index(drop=True)\n",
|
| 412 |
+
"tasks = tasks.drop(columns=[\"genome\"])\n",
|
| 413 |
+
"tasks[\"assay\"] = tasks[\"description\"].apply(lambda x: x.split(\":\")[0])\n",
|
| 414 |
+
"tasks[\"sample\"] = tasks[\"description\"].apply(lambda x: \":\".join(x.split(\":\")[1:]))\n",
|
| 415 |
+
"tasks = tasks.rename(columns={\"identifier\":\"name\"})\n",
|
| 416 |
+
"tasks.head()"
|
| 417 |
+
]
|
| 418 |
+
},
|
| 419 |
+
{
|
| 420 |
+
"cell_type": "code",
|
| 421 |
+
"execution_count": 8,
|
| 422 |
+
"id": "d914129b-4dcd-4a11-9a8e-f99618200024",
|
| 423 |
+
"metadata": {},
|
| 424 |
+
"outputs": [],
|
| 425 |
+
"source": [
|
| 426 |
+
"tasks = tasks.to_dict(orient=\"list\")"
|
| 427 |
+
]
|
| 428 |
+
},
|
| 429 |
+
{
|
| 430 |
+
"cell_type": "markdown",
|
| 431 |
+
"id": "0caaa50e-a1dd-4d2a-ab17-06e545cef485",
|
| 432 |
+
"metadata": {},
|
| 433 |
+
"source": [
|
| 434 |
+
"## Process intervals"
|
| 435 |
+
]
|
| 436 |
+
},
|
| 437 |
+
{
|
| 438 |
+
"cell_type": "code",
|
| 439 |
+
"execution_count": 9,
|
| 440 |
+
"id": "456bf351-55aa-426d-a10d-0edc9b3591f4",
|
| 441 |
+
"metadata": {},
|
| 442 |
+
"outputs": [
|
| 443 |
+
{
|
| 444 |
+
"data": {
|
| 445 |
+
"text/html": [
|
| 446 |
+
"<div>\n",
|
| 447 |
+
"<style scoped>\n",
|
| 448 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 449 |
+
" vertical-align: middle;\n",
|
| 450 |
+
" }\n",
|
| 451 |
+
"\n",
|
| 452 |
+
" .dataframe tbody tr th {\n",
|
| 453 |
+
" vertical-align: top;\n",
|
| 454 |
+
" }\n",
|
| 455 |
+
"\n",
|
| 456 |
+
" .dataframe thead th {\n",
|
| 457 |
+
" text-align: right;\n",
|
| 458 |
+
" }\n",
|
| 459 |
+
"</style>\n",
|
| 460 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 461 |
+
" <thead>\n",
|
| 462 |
+
" <tr style=\"text-align: right;\">\n",
|
| 463 |
+
" <th></th>\n",
|
| 464 |
+
" <th>chrom</th>\n",
|
| 465 |
+
" <th>start</th>\n",
|
| 466 |
+
" <th>end</th>\n",
|
| 467 |
+
" <th>split</th>\n",
|
| 468 |
+
" </tr>\n",
|
| 469 |
+
" </thead>\n",
|
| 470 |
+
" <tbody>\n",
|
| 471 |
+
" <tr>\n",
|
| 472 |
+
" <th>0</th>\n",
|
| 473 |
+
" <td>chr4</td>\n",
|
| 474 |
+
" <td>34106647</td>\n",
|
| 475 |
+
" <td>34237719</td>\n",
|
| 476 |
+
" <td>train</td>\n",
|
| 477 |
+
" </tr>\n",
|
| 478 |
+
" <tr>\n",
|
| 479 |
+
" <th>1</th>\n",
|
| 480 |
+
" <td>chr5</td>\n",
|
| 481 |
+
" <td>52207747</td>\n",
|
| 482 |
+
" <td>52338819</td>\n",
|
| 483 |
+
" <td>train</td>\n",
|
| 484 |
+
" </tr>\n",
|
| 485 |
+
" <tr>\n",
|
| 486 |
+
" <th>2</th>\n",
|
| 487 |
+
" <td>chr19</td>\n",
|
| 488 |
+
" <td>20136862</td>\n",
|
| 489 |
+
" <td>20267934</td>\n",
|
| 490 |
+
" <td>train</td>\n",
|
| 491 |
+
" </tr>\n",
|
| 492 |
+
" <tr>\n",
|
| 493 |
+
" <th>3</th>\n",
|
| 494 |
+
" <td>chr14</td>\n",
|
| 495 |
+
" <td>61845439</td>\n",
|
| 496 |
+
" <td>61976511</td>\n",
|
| 497 |
+
" <td>train</td>\n",
|
| 498 |
+
" </tr>\n",
|
| 499 |
+
" <tr>\n",
|
| 500 |
+
" <th>4</th>\n",
|
| 501 |
+
" <td>chr15</td>\n",
|
| 502 |
+
" <td>6592346</td>\n",
|
| 503 |
+
" <td>6723418</td>\n",
|
| 504 |
+
" <td>train</td>\n",
|
| 505 |
+
" </tr>\n",
|
| 506 |
+
" </tbody>\n",
|
| 507 |
+
"</table>\n",
|
| 508 |
+
"</div>"
|
| 509 |
+
],
|
| 510 |
+
"text/plain": [
|
| 511 |
+
" chrom start end split\n",
|
| 512 |
+
"0 chr4 34106647 34237719 train\n",
|
| 513 |
+
"1 chr5 52207747 52338819 train\n",
|
| 514 |
+
"2 chr19 20136862 20267934 train\n",
|
| 515 |
+
"3 chr14 61845439 61976511 train\n",
|
| 516 |
+
"4 chr15 6592346 6723418 train"
|
| 517 |
+
]
|
| 518 |
+
},
|
| 519 |
+
"execution_count": 9,
|
| 520 |
+
"metadata": {},
|
| 521 |
+
"output_type": "execute_result"
|
| 522 |
+
}
|
| 523 |
+
],
|
| 524 |
+
"source": [
|
| 525 |
+
"intervals = pd.read_table(sequences_path, header=None, names = ['chrom', 'start', 'end', 'split'])\n",
|
| 526 |
+
"intervals.head()"
|
| 527 |
+
]
|
| 528 |
+
},
|
| 529 |
+
{
|
| 530 |
+
"cell_type": "code",
|
| 531 |
+
"execution_count": 10,
|
| 532 |
+
"id": "237d58cd-4250-46de-af72-9fd5a481657e",
|
| 533 |
+
"metadata": {},
|
| 534 |
+
"outputs": [
|
| 535 |
+
{
|
| 536 |
+
"data": {
|
| 537 |
+
"text/plain": [
|
| 538 |
+
"split\n",
|
| 539 |
+
"train 29295\n",
|
| 540 |
+
"valid 2209\n",
|
| 541 |
+
"test 2017\n",
|
| 542 |
+
"Name: count, dtype: int64"
|
| 543 |
+
]
|
| 544 |
+
},
|
| 545 |
+
"execution_count": 10,
|
| 546 |
+
"metadata": {},
|
| 547 |
+
"output_type": "execute_result"
|
| 548 |
+
}
|
| 549 |
+
],
|
| 550 |
+
"source": [
|
| 551 |
+
"intervals.split.value_counts()"
|
| 552 |
+
]
|
| 553 |
+
},
|
| 554 |
+
{
|
| 555 |
+
"cell_type": "code",
|
| 556 |
+
"execution_count": 11,
|
| 557 |
+
"id": "8fb0c5d9-fc33-498f-b026-a9ac93b59418",
|
| 558 |
+
"metadata": {},
|
| 559 |
+
"outputs": [
|
| 560 |
+
{
|
| 561 |
+
"data": {
|
| 562 |
+
"text/plain": [
|
| 563 |
+
"131072"
|
| 564 |
+
]
|
| 565 |
+
},
|
| 566 |
+
"execution_count": 11,
|
| 567 |
+
"metadata": {},
|
| 568 |
+
"output_type": "execute_result"
|
| 569 |
+
}
|
| 570 |
+
],
|
| 571 |
+
"source": [
|
| 572 |
+
"get_unique_length(intervals)"
|
| 573 |
+
]
|
| 574 |
+
},
|
| 575 |
+
{
|
| 576 |
+
"cell_type": "code",
|
| 577 |
+
"execution_count": 12,
|
| 578 |
+
"id": "a1633ea5-82a6-45d7-b28e-9ed1cef59c24",
|
| 579 |
+
"metadata": {},
|
| 580 |
+
"outputs": [
|
| 581 |
+
{
|
| 582 |
+
"data": {
|
| 583 |
+
"text/html": [
|
| 584 |
+
"<div>\n",
|
| 585 |
+
"<style scoped>\n",
|
| 586 |
+
" .dataframe tbody tr th:only-of-type {\n",
|
| 587 |
+
" vertical-align: middle;\n",
|
| 588 |
+
" }\n",
|
| 589 |
+
"\n",
|
| 590 |
+
" .dataframe tbody tr th {\n",
|
| 591 |
+
" vertical-align: top;\n",
|
| 592 |
+
" }\n",
|
| 593 |
+
"\n",
|
| 594 |
+
" .dataframe thead th {\n",
|
| 595 |
+
" text-align: right;\n",
|
| 596 |
+
" }\n",
|
| 597 |
+
"</style>\n",
|
| 598 |
+
"<table border=\"1\" class=\"dataframe\">\n",
|
| 599 |
+
" <thead>\n",
|
| 600 |
+
" <tr style=\"text-align: right;\">\n",
|
| 601 |
+
" <th></th>\n",
|
| 602 |
+
" <th>chrom</th>\n",
|
| 603 |
+
" <th>start</th>\n",
|
| 604 |
+
" <th>end</th>\n",
|
| 605 |
+
" <th>split</th>\n",
|
| 606 |
+
" </tr>\n",
|
| 607 |
+
" </thead>\n",
|
| 608 |
+
" <tbody>\n",
|
| 609 |
+
" <tr>\n",
|
| 610 |
+
" <th>0</th>\n",
|
| 611 |
+
" <td>chr4</td>\n",
|
| 612 |
+
" <td>34073879</td>\n",
|
| 613 |
+
" <td>34270487</td>\n",
|
| 614 |
+
" <td>train</td>\n",
|
| 615 |
+
" </tr>\n",
|
| 616 |
+
" <tr>\n",
|
| 617 |
+
" <th>1</th>\n",
|
| 618 |
+
" <td>chr5</td>\n",
|
| 619 |
+
" <td>52174979</td>\n",
|
| 620 |
+
" <td>52371587</td>\n",
|
| 621 |
+
" <td>train</td>\n",
|
| 622 |
+
" </tr>\n",
|
| 623 |
+
" <tr>\n",
|
| 624 |
+
" <th>2</th>\n",
|
| 625 |
+
" <td>chr19</td>\n",
|
| 626 |
+
" <td>20104094</td>\n",
|
| 627 |
+
" <td>20300702</td>\n",
|
| 628 |
+
" <td>train</td>\n",
|
| 629 |
+
" </tr>\n",
|
| 630 |
+
" <tr>\n",
|
| 631 |
+
" <th>3</th>\n",
|
| 632 |
+
" <td>chr14</td>\n",
|
| 633 |
+
" <td>61812671</td>\n",
|
| 634 |
+
" <td>62009279</td>\n",
|
| 635 |
+
" <td>train</td>\n",
|
| 636 |
+
" </tr>\n",
|
| 637 |
+
" <tr>\n",
|
| 638 |
+
" <th>4</th>\n",
|
| 639 |
+
" <td>chr15</td>\n",
|
| 640 |
+
" <td>6559578</td>\n",
|
| 641 |
+
" <td>6756186</td>\n",
|
| 642 |
+
" <td>train</td>\n",
|
| 643 |
+
" </tr>\n",
|
| 644 |
+
" </tbody>\n",
|
| 645 |
+
"</table>\n",
|
| 646 |
+
"</div>"
|
| 647 |
+
],
|
| 648 |
+
"text/plain": [
|
| 649 |
+
" chrom start end split\n",
|
| 650 |
+
"0 chr4 34073879 34270487 train\n",
|
| 651 |
+
"1 chr5 52174979 52371587 train\n",
|
| 652 |
+
"2 chr19 20104094 20300702 train\n",
|
| 653 |
+
"3 chr14 61812671 62009279 train\n",
|
| 654 |
+
"4 chr15 6559578 6756186 train"
|
| 655 |
+
]
|
| 656 |
+
},
|
| 657 |
+
"execution_count": 12,
|
| 658 |
+
"metadata": {},
|
| 659 |
+
"output_type": "execute_result"
|
| 660 |
+
}
|
| 661 |
+
],
|
| 662 |
+
"source": [
|
| 663 |
+
"intervals = resize(intervals, 196608)\n",
|
| 664 |
+
"intervals.head()"
|
| 665 |
+
]
|
| 666 |
+
},
|
| 667 |
+
{
|
| 668 |
+
"cell_type": "code",
|
| 669 |
+
"execution_count": 13,
|
| 670 |
+
"id": "75f1f860-dcd3-4e9f-8a2d-95ca95e63654",
|
| 671 |
+
"metadata": {},
|
| 672 |
+
"outputs": [],
|
| 673 |
+
"source": [
|
| 674 |
+
"train_intervals = intervals[intervals.split=='train'].iloc[:, :3]\n",
|
| 675 |
+
"val_intervals = intervals[intervals.split=='valid'].iloc[:, :3]\n",
|
| 676 |
+
"test_intervals = intervals[intervals.split=='test'].iloc[:, :3]\n",
|
| 677 |
+
"del intervals"
|
| 678 |
+
]
|
| 679 |
+
},
|
| 680 |
+
{
|
| 681 |
+
"cell_type": "markdown",
|
| 682 |
+
"id": "62ecac6c-f031-4269-9672-50f57b883368",
|
| 683 |
+
"metadata": {},
|
| 684 |
+
"source": [
|
| 685 |
+
"## Initialize model"
|
| 686 |
+
]
|
| 687 |
+
},
|
| 688 |
+
{
|
| 689 |
+
"cell_type": "code",
|
| 690 |
+
"execution_count": 14,
|
| 691 |
+
"id": "a5f3af07-c670-4ceb-abc1-891f45b298d0",
|
| 692 |
+
"metadata": {},
|
| 693 |
+
"outputs": [],
|
| 694 |
+
"source": [
|
| 695 |
+
"model_params={\n",
|
| 696 |
+
" 'model_type':'EnformerModel',\n",
|
| 697 |
+
" 'final_act_func': 'softplus',\n",
|
| 698 |
+
" 'final_pool_func':None,\n",
|
| 699 |
+
" 'n_tasks': 1643,\n",
|
| 700 |
+
" 'crop_len':320,\n",
|
| 701 |
+
"}\n",
|
| 702 |
+
"train_params={'task':'regression', 'loss':'mse'}\n",
|
| 703 |
+
"\n",
|
| 704 |
+
"model = LightningModel(model_params, train_params)"
|
| 705 |
+
]
|
| 706 |
+
},
|
| 707 |
+
{
|
| 708 |
+
"cell_type": "markdown",
|
| 709 |
+
"id": "4e6981d5-fb40-486a-bb34-db296ac14778",
|
| 710 |
+
"metadata": {},
|
| 711 |
+
"source": [
|
| 712 |
+
"## Load weights"
|
| 713 |
+
]
|
| 714 |
+
},
|
| 715 |
+
{
|
| 716 |
+
"cell_type": "code",
|
| 717 |
+
"execution_count": 15,
|
| 718 |
+
"id": "08cef7a5-3a9c-4342-8d82-2eabf8eac4d6",
|
| 719 |
+
"metadata": {},
|
| 720 |
+
"outputs": [
|
| 721 |
+
{
|
| 722 |
+
"name": "stderr",
|
| 723 |
+
"output_type": "stream",
|
| 724 |
+
"text": [
|
| 725 |
+
"/tmp/ipykernel_3297608/3379349373.py:1: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 726 |
+
" state_dict = torch.load(\"/data/enformer/torch_weights/mouse.h5\")\n"
|
| 727 |
+
]
|
| 728 |
+
},
|
| 729 |
+
{
|
| 730 |
+
"data": {
|
| 731 |
+
"text/plain": [
|
| 732 |
+
"<All keys matched successfully>"
|
| 733 |
+
]
|
| 734 |
+
},
|
| 735 |
+
"execution_count": 15,
|
| 736 |
+
"metadata": {},
|
| 737 |
+
"output_type": "execute_result"
|
| 738 |
+
}
|
| 739 |
+
],
|
| 740 |
+
"source": [
|
| 741 |
+
"state_dict = torch.load(\"/data/enformer/torch_weights/mouse.h5\")\n",
|
| 742 |
+
"model.model.load_state_dict(state_dict)"
|
| 743 |
+
]
|
| 744 |
+
},
|
| 745 |
+
{
|
| 746 |
+
"cell_type": "markdown",
|
| 747 |
+
"id": "9b11fb1f-2d53-4b08-ae6e-15c3dfa1c1b1",
|
| 748 |
+
"metadata": {},
|
| 749 |
+
"source": [
|
| 750 |
+
"## Add hparams"
|
| 751 |
+
]
|
| 752 |
+
},
|
| 753 |
+
{
|
| 754 |
+
"cell_type": "code",
|
| 755 |
+
"execution_count": 16,
|
| 756 |
+
"id": "d9a32cf3-4c69-45d9-81b9-f7c5b83ae763",
|
| 757 |
+
"metadata": {},
|
| 758 |
+
"outputs": [],
|
| 759 |
+
"source": [
|
| 760 |
+
"model.data_params[\"train\"] = dict()\n",
|
| 761 |
+
"model.data_params[\"val\"] = dict()\n",
|
| 762 |
+
"model.data_params[\"test\"] = dict()"
|
| 763 |
+
]
|
| 764 |
+
},
|
| 765 |
+
{
|
| 766 |
+
"cell_type": "code",
|
| 767 |
+
"execution_count": 17,
|
| 768 |
+
"id": "09cbfeb5-9f73-44f8-81c5-7f5db0ea577e",
|
| 769 |
+
"metadata": {},
|
| 770 |
+
"outputs": [],
|
| 771 |
+
"source": [
|
| 772 |
+
"model.data_params[\"train\"][\"seq_len\"] = 196608\n",
|
| 773 |
+
"model.data_params[\"train\"][\"label_len\"] = 896 * 128\n",
|
| 774 |
+
"model.data_params[\"train\"][\"genome\"] = \"mm10\"\n",
|
| 775 |
+
"model.data_params[\"train\"][\"bin_size\"] = 128\n",
|
| 776 |
+
"model.data_params[\"train\"][\"max_seq_shift\"] = 3\n",
|
| 777 |
+
"model.data_params[\"train\"][\"rc\"] = True"
|
| 778 |
+
]
|
| 779 |
+
},
|
| 780 |
+
{
|
| 781 |
+
"cell_type": "markdown",
|
| 782 |
+
"id": "208f6727-0540-43dd-8554-4036b8e57180",
|
| 783 |
+
"metadata": {},
|
| 784 |
+
"source": [
|
| 785 |
+
"## Add tasks"
|
| 786 |
+
]
|
| 787 |
+
},
|
| 788 |
+
{
|
| 789 |
+
"cell_type": "code",
|
| 790 |
+
"execution_count": 18,
|
| 791 |
+
"id": "ba516087-edd9-46b8-9e76-44ff35d6e553",
|
| 792 |
+
"metadata": {},
|
| 793 |
+
"outputs": [],
|
| 794 |
+
"source": [
|
| 795 |
+
"model.data_params[\"tasks\"] = tasks"
|
| 796 |
+
]
|
| 797 |
+
},
|
| 798 |
+
{
|
| 799 |
+
"cell_type": "markdown",
|
| 800 |
+
"id": "ab5ef3dd-aaa4-4322-a25f-d6b818efb893",
|
| 801 |
+
"metadata": {},
|
| 802 |
+
"source": [
|
| 803 |
+
"## Add intervals"
|
| 804 |
+
]
|
| 805 |
+
},
|
| 806 |
+
{
|
| 807 |
+
"cell_type": "code",
|
| 808 |
+
"execution_count": 19,
|
| 809 |
+
"id": "be3564f3-2b4d-4618-9277-faceab43b7cd",
|
| 810 |
+
"metadata": {},
|
| 811 |
+
"outputs": [],
|
| 812 |
+
"source": [
|
| 813 |
+
"model.data_params[\"train\"][\"intervals\"] = train_intervals.to_dict(orient='list')\n",
|
| 814 |
+
"model.data_params[\"val\"][\"intervals\"] = val_intervals.to_dict(orient='list')\n",
|
| 815 |
+
"model.data_params[\"test\"][\"intervals\"] = test_intervals.to_dict(orient='list')"
|
| 816 |
+
]
|
| 817 |
+
},
|
| 818 |
+
{
|
| 819 |
+
"cell_type": "markdown",
|
| 820 |
+
"id": "d8471c54-8716-41de-84b0-212ba7419adb",
|
| 821 |
+
"metadata": {},
|
| 822 |
+
"source": [
|
| 823 |
+
"## Save"
|
| 824 |
+
]
|
| 825 |
+
},
|
| 826 |
+
{
|
| 827 |
+
"cell_type": "code",
|
| 828 |
+
"execution_count": 20,
|
| 829 |
+
"id": "7c4e3973-40e4-46c7-9ac9-8985d82498e1",
|
| 830 |
+
"metadata": {},
|
| 831 |
+
"outputs": [
|
| 832 |
+
{
|
| 833 |
+
"name": "stderr",
|
| 834 |
+
"output_type": "stream",
|
| 835 |
+
"text": [
|
| 836 |
+
"Trainer will use only 1 of 8 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=8)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.\n",
|
| 837 |
+
"GPU available: True (cuda), used: True\n",
|
| 838 |
+
"TPU available: False, using: 0 TPU cores\n",
|
| 839 |
+
"HPU available: False, using: 0 HPUs\n",
|
| 840 |
+
"You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision\n",
|
| 841 |
+
"LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]\n"
|
| 842 |
+
]
|
| 843 |
+
}
|
| 844 |
+
],
|
| 845 |
+
"source": [
|
| 846 |
+
"trainer = pl.Trainer()\n",
|
| 847 |
+
"try:\n",
|
| 848 |
+
" trainer.predict(model) \n",
|
| 849 |
+
"except:\n",
|
| 850 |
+
" trainer.save_checkpoint('/data/enformer/torch_weights/mouse.ckpt')"
|
| 851 |
+
]
|
| 852 |
+
},
|
| 853 |
+
{
|
| 854 |
+
"cell_type": "markdown",
|
| 855 |
+
"id": "5dd274ed-ba38-41ab-8288-ca848c33b539",
|
| 856 |
+
"metadata": {},
|
| 857 |
+
"source": [
|
| 858 |
+
"## Upload"
|
| 859 |
+
]
|
| 860 |
+
},
|
| 861 |
+
{
|
| 862 |
+
"cell_type": "code",
|
| 863 |
+
"execution_count": 21,
|
| 864 |
+
"id": "92ff666a-ac2a-41b4-a48f-dc2faeb1a3dd",
|
| 865 |
+
"metadata": {},
|
| 866 |
+
"outputs": [
|
| 867 |
+
{
|
| 868 |
+
"name": "stderr",
|
| 869 |
+
"output_type": "stream",
|
| 870 |
+
"text": [
|
| 871 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 246488 bytes\n",
|
| 872 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 246488 bytes\n",
|
| 873 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 246488 bytes\n",
|
| 874 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 246488 bytes\n",
|
| 875 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 246488 bytes\n",
|
| 876 |
+
"\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[33mWARNING\u001b[0m Serializing object of type list that is 246488 bytes\n"
|
| 877 |
+
]
|
| 878 |
+
},
|
| 879 |
+
{
|
| 880 |
+
"data": {
|
| 881 |
+
"text/plain": [
|
| 882 |
+
"<Artifact mouse>"
|
| 883 |
+
]
|
| 884 |
+
},
|
| 885 |
+
"execution_count": 21,
|
| 886 |
+
"metadata": {},
|
| 887 |
+
"output_type": "execute_result"
|
| 888 |
+
}
|
| 889 |
+
],
|
| 890 |
+
"source": [
|
| 891 |
+
"artifact = wandb.Artifact(\n",
|
| 892 |
+
" 'mouse', \n",
|
| 893 |
+
" type='model',\n",
|
| 894 |
+
" metadata={\n",
|
| 895 |
+
" 'model_params':model.model_params, \n",
|
| 896 |
+
" 'train_params':model.train_params, \n",
|
| 897 |
+
" 'data_params':model.data_params\n",
|
| 898 |
+
" }\n",
|
| 899 |
+
")\n",
|
| 900 |
+
"artifact.add_file(local_path='/data/enformer/torch_weights/mouse.ckpt', name='model.ckpt')\n",
|
| 901 |
+
"run.log_artifact(artifact)"
|
| 902 |
+
]
|
| 903 |
+
},
|
| 904 |
+
{
|
| 905 |
+
"cell_type": "code",
|
| 906 |
+
"execution_count": 22,
|
| 907 |
+
"id": "9269df1a-b6d4-410d-bbe6-39a4b70576cc",
|
| 908 |
+
"metadata": {},
|
| 909 |
+
"outputs": [
|
| 910 |
+
{
|
| 911 |
+
"data": {
|
| 912 |
+
"text/html": [],
|
| 913 |
+
"text/plain": [
|
| 914 |
+
"<IPython.core.display.HTML object>"
|
| 915 |
+
]
|
| 916 |
+
},
|
| 917 |
+
"metadata": {},
|
| 918 |
+
"output_type": "display_data"
|
| 919 |
+
},
|
| 920 |
+
{
|
| 921 |
+
"data": {
|
| 922 |
+
"text/html": [
|
| 923 |
+
" View run <strong style=\"color:#cdcd00\">copy-mouse</strong> at: <a href='https://wandb.ai/grelu/enformer/runs/jrgxfvad' target=\"_blank\">https://wandb.ai/grelu/enformer/runs/jrgxfvad</a><br> View project at: <a href='https://wandb.ai/grelu/enformer' target=\"_blank\">https://wandb.ai/grelu/enformer</a><br>Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)"
|
| 924 |
+
],
|
| 925 |
+
"text/plain": [
|
| 926 |
+
"<IPython.core.display.HTML object>"
|
| 927 |
+
]
|
| 928 |
+
},
|
| 929 |
+
"metadata": {},
|
| 930 |
+
"output_type": "display_data"
|
| 931 |
+
},
|
| 932 |
+
{
|
| 933 |
+
"data": {
|
| 934 |
+
"text/html": [
|
| 935 |
+
"Find logs at: <code>./wandb/run-20250304_222920-jrgxfvad/logs</code>"
|
| 936 |
+
],
|
| 937 |
+
"text/plain": [
|
| 938 |
+
"<IPython.core.display.HTML object>"
|
| 939 |
+
]
|
| 940 |
+
},
|
| 941 |
+
"metadata": {},
|
| 942 |
+
"output_type": "display_data"
|
| 943 |
+
}
|
| 944 |
+
],
|
| 945 |
+
"source": [
|
| 946 |
+
"run.finish() "
|
| 947 |
+
]
|
| 948 |
+
},
|
| 949 |
+
{
|
| 950 |
+
"cell_type": "code",
|
| 951 |
+
"execution_count": null,
|
| 952 |
+
"id": "b00d26d5-e7b6-47c1-a8b5-7d232d0c2592",
|
| 953 |
+
"metadata": {},
|
| 954 |
+
"outputs": [],
|
| 955 |
+
"source": []
|
| 956 |
+
}
|
| 957 |
+
],
|
| 958 |
+
"metadata": {
|
| 959 |
+
"kernelspec": {
|
| 960 |
+
"display_name": "Python 3 (ipykernel)",
|
| 961 |
+
"language": "python",
|
| 962 |
+
"name": "python3"
|
| 963 |
+
},
|
| 964 |
+
"language_info": {
|
| 965 |
+
"codemirror_mode": {
|
| 966 |
+
"name": "ipython",
|
| 967 |
+
"version": 3
|
| 968 |
+
},
|
| 969 |
+
"file_extension": ".py",
|
| 970 |
+
"mimetype": "text/x-python",
|
| 971 |
+
"name": "python",
|
| 972 |
+
"nbconvert_exporter": "python",
|
| 973 |
+
"pygments_lexer": "ipython3",
|
| 974 |
+
"version": "3.11.9"
|
| 975 |
+
}
|
| 976 |
+
},
|
| 977 |
+
"nbformat": 4,
|
| 978 |
+
"nbformat_minor": 5
|
| 979 |
+
}
|