|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- transformer |
|
|
- methylation |
|
|
- epigenomics |
|
|
- pretraining |
|
|
- masked-regression |
|
|
datasets: |
|
|
- custom |
|
|
language: en |
|
|
library_name: transformers |
|
|
model_name: MethFormer |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# π§ MethFormer: A Transformer for DNA Methylation |
|
|
|
|
|
**MethFormer** is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Overview |
|
|
|
|
|
* **Inputs**: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins Γ 2 channels) |
|
|
* **Pretraining objective**: Masked methylation imputation (per-bin regression) |
|
|
* **Architecture**: Transformer encoder with linear projection head |
|
|
* **Downstream tasks**: MLL binding prediction, chromatin state inference, or enhancer classification |
|
|
|
|
|
--- |
|
|
|
|
|
## π Project Structure |
|
|
|
|
|
``` |
|
|
. |
|
|
βββ config/ # config |
|
|
βββ data/ # Binned methylation datasets (HuggingFace format) |
|
|
βββ output/ # Pretrained models, logs, and checkpoints |
|
|
βββ scripts/ |
|
|
β βββ methformer.py # Model classes, data collator, |
|
|
β βββ pretrain_methformer.py # Main training script |
|
|
β βββ finetune_mll.py # (optional) downstream fine-tuning |
|
|
βββ requirements.txt |
|
|
βββ README.md |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π©βπ» Pretraining MethFormer |
|
|
|
|
|
### Step 1: Prepare Dataset |
|
|
|
|
|
Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins Γ 2 features. Save using Hugging Face's `datasets.DatasetDict` format: |
|
|
|
|
|
``` |
|
|
DatasetDict({ |
|
|
train: Dataset({ |
|
|
features: ['input_values', 'attention_mask', 'labels'] |
|
|
}), |
|
|
validation: Dataset(...) |
|
|
}) |
|
|
``` |
|
|
|
|
|
### Step 2: Run Pretraining |
|
|
|
|
|
```bash |
|
|
python scripts/pretrain_methformer.py |
|
|
``` |
|
|
|
|
|
Options can be customized inside the script or modified for sweep tuning. This will: |
|
|
|
|
|
* Train the model using masked regression loss |
|
|
* Evaluate on a held-out chromosome (e.g., `chr8`) |
|
|
* Log metrics to [Weights & Biases](https://wandb.ai) |
|
|
* Save the best model checkpoint |
|
|
|
|
|
--- |
|
|
|
|
|
## π Metrics |
|
|
|
|
|
* `masked_mse`: Mean squared error over unmasked positions |
|
|
* `masked_mae`: Mean absolute error |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Fine-tuning on MLL Binding |
|
|
|
|
|
After pretraining: |
|
|
|
|
|
1. Replace the regression head with a scalar head for MLL prediction. |
|
|
2. Use a `Trainer` to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions. |
|
|
|
|
|
See `scripts/finetune_mll.py` for an example. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Visualizations & Interpretability |
|
|
|
|
|
You can run [Captum](https://captum.ai) or SHAP for: |
|
|
|
|
|
* Per-bin attribution of 5mC/5hmC to MLL binding |
|
|
* Visualizing what MethFormer attends to during fine-tuning |
|
|
|
|
|
--- |
|
|
|
|
|
## π οΈ Dependencies |
|
|
|
|
|
Key packages: |
|
|
|
|
|
* `transformers` |
|
|
* `datasets` |
|
|
* `wandb` |
|
|
* `torch` |
|
|
* `anndata` |
|
|
* `scikit-learn` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Acknowledgements |
|
|
|
|
|
* Built with inspiration from DNABERT, Grover, and vision transformers |
|
|
|