CChahrour
/

Methformer

Feature Extraction

masked-regression

Model card Files Files and versions

Methformer / README.md

CChahrour's picture

Upload folder using huggingface_hub

97c647f verified 9 months ago

|

history blame contribute delete

3.07 kB

	---
	license: mit
	tags:
	- transformer
	- methylation
	- epigenomics
	- pretraining
	- masked-regression
	datasets:
	- custom
	language: en
	library_name: transformers
	model_name: MethFormer
	pipeline_tag: feature-extraction
	---

	# 🧚 MethFormer: A Transformer for DNA Methylation

	MethFormer is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state.

	---

	## 🚀 Overview

	* Inputs: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins × 2 channels)
	* Pretraining objective: Masked methylation imputation (per-bin regression)
	* Architecture: Transformer encoder with linear projection head
	* Downstream tasks: MLL binding prediction, chromatin state inference, or enhancer classification

	---

	## 📁 Project Structure

	```
	.
	├── config/ # config
	├── data/ # Binned methylation datasets (HuggingFace format)
	├── output/ # Pretrained models, logs, and checkpoints
	├── scripts/
	│ ├── methformer.py # Model classes, data collator,
	│ ├── pretrain_methformer.py # Main training script
	│ └── finetune_mll.py # (optional) downstream fine-tuning
	├── requirements.txt
	└── README.md
	```

	---

	## 👩‍💻 Pretraining MethFormer

	### Step 1: Prepare Dataset

	Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins × 2 features. Save using Hugging Face's `datasets.DatasetDict` format:

	```
	DatasetDict({
	train: Dataset({
	features: ['input_values', 'attention_mask', 'labels']
	}),
	validation: Dataset(...)
	})
	```

	### Step 2: Run Pretraining

	```bash
	python scripts/pretrain_methformer.py
	```

	Options can be customized inside the script or modified for sweep tuning. This will:

	* Train the model using masked regression loss
	* Evaluate on a held-out chromosome (e.g., `chr8`)
	* Log metrics to [Weights & Biases](https://wandb.ai)
	* Save the best model checkpoint

	---

	## 📊 Metrics

	* `masked_mse`: Mean squared error over unmasked positions
	* `masked_mae`: Mean absolute error

	---

	## 🧪 Fine-tuning on MLL Binding

	After pretraining:

	1. Replace the regression head with a scalar head for MLL prediction.
	2. Use a `Trainer` to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions.

	See `scripts/finetune_mll.py` for an example.

	---

	## 🔍 Visualizations & Interpretability

	You can run [Captum](https://captum.ai) or SHAP for:

	* Per-bin attribution of 5mC/5hmC to MLL binding
	* Visualizing what MethFormer attends to during fine-tuning

	---

	## 🛠️ Dependencies

	Key packages:

	* `transformers`
	* `datasets`
	* `wandb`
	* `torch`
	* `anndata`
	* `scikit-learn`

	---

	## 🧠 Acknowledgements

	* Built with inspiration from DNABERT, Grover, and vision transformers