File size: 3,065 Bytes
97c647f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a0f27c
 
 
 
1926288
9a0f27c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: mit
tags:
  - transformer
  - methylation
  - epigenomics
  - pretraining
  - masked-regression
datasets:
  - custom
language: en
library_name: transformers
model_name: MethFormer
pipeline_tag: feature-extraction
---

# 🧚 MethFormer: A Transformer for DNA Methylation

**MethFormer** is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state.

---

## πŸš€ Overview

* **Inputs**: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins Γ— 2 channels)
* **Pretraining objective**: Masked methylation imputation (per-bin regression)
* **Architecture**: Transformer encoder with linear projection head
* **Downstream tasks**: MLL binding prediction, chromatin state inference, or enhancer classification

---

## πŸ“ Project Structure

```
.
β”œβ”€β”€ config/                       # config
β”œβ”€β”€ data/                         # Binned methylation datasets (HuggingFace format)
β”œβ”€β”€ output/                       # Pretrained models, logs, and checkpoints
β”œβ”€β”€ scripts/                      
β”‚   β”œβ”€β”€ methformer.py             # Model classes, data collator, 
β”‚   β”œβ”€β”€ pretrain_methformer.py    # Main training script
β”‚   └── finetune_mll.py           # (optional) downstream fine-tuning
β”œβ”€β”€ requirements.txt
└── README.md
```

---

## πŸ‘©β€πŸ’» Pretraining MethFormer

### Step 1: Prepare Dataset

Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins Γ— 2 features. Save using Hugging Face's `datasets.DatasetDict` format:

```
DatasetDict({
  train: Dataset({
    features: ['input_values', 'attention_mask', 'labels']
  }),
  validation: Dataset(...)
})
```

### Step 2: Run Pretraining

```bash
python scripts/pretrain_methformer.py
```

Options can be customized inside the script or modified for sweep tuning. This will:

* Train the model using masked regression loss
* Evaluate on a held-out chromosome (e.g., `chr8`)
* Log metrics to [Weights & Biases](https://wandb.ai)
* Save the best model checkpoint

---

## πŸ“Š Metrics

* `masked_mse`: Mean squared error over unmasked positions
* `masked_mae`: Mean absolute error

---

## πŸ§ͺ Fine-tuning on MLL Binding

After pretraining:

1. Replace the regression head with a scalar head for MLL prediction.
2. Use a `Trainer` to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions.

See `scripts/finetune_mll.py` for an example.

---

## πŸ” Visualizations & Interpretability

You can run [Captum](https://captum.ai) or SHAP for:

* Per-bin attribution of 5mC/5hmC to MLL binding
* Visualizing what MethFormer attends to during fine-tuning

---

## πŸ› οΈ Dependencies

Key packages:

* `transformers`
* `datasets`
* `wandb`
* `torch`
* `anndata`
* `scikit-learn`

---

## 🧠 Acknowledgements

* Built with inspiration from DNABERT, Grover, and vision transformers