metadata
license: mit
tags:
- transformer
- methylation
- epigenomics
- pretraining
- masked-regression
datasets:
- custom
language: en
library_name: transformers
model_name: MethFormer
pipeline_tag: feature-extraction
π§ MethFormer: A Transformer for DNA Methylation
MethFormer is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state.
π Overview
- Inputs: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins Γ 2 channels)
- Pretraining objective: Masked methylation imputation (per-bin regression)
- Architecture: Transformer encoder with linear projection head
- Downstream tasks: MLL binding prediction, chromatin state inference, or enhancer classification
π Project Structure
.
βββ config/ # config
βββ data/ # Binned methylation datasets (HuggingFace format)
βββ output/ # Pretrained models, logs, and checkpoints
βββ scripts/
β βββ methformer.py # Model classes, data collator,
β βββ pretrain_methformer.py # Main training script
β βββ finetune_mll.py # (optional) downstream fine-tuning
βββ requirements.txt
βββ README.md
π©βπ» Pretraining MethFormer
Step 1: Prepare Dataset
Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins Γ 2 features. Save using Hugging Face's datasets.DatasetDict format:
DatasetDict({
train: Dataset({
features: ['input_values', 'attention_mask', 'labels']
}),
validation: Dataset(...)
})
Step 2: Run Pretraining
python scripts/pretrain_methformer.py
Options can be customized inside the script or modified for sweep tuning. This will:
- Train the model using masked regression loss
- Evaluate on a held-out chromosome (e.g.,
chr8) - Log metrics to Weights & Biases
- Save the best model checkpoint
π Metrics
masked_mse: Mean squared error over unmasked positionsmasked_mae: Mean absolute error
π§ͺ Fine-tuning on MLL Binding
After pretraining:
- Replace the regression head with a scalar head for MLL prediction.
- Use a
Trainerto fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions.
See scripts/finetune_mll.py for an example.
π Visualizations & Interpretability
You can run Captum or SHAP for:
- Per-bin attribution of 5mC/5hmC to MLL binding
- Visualizing what MethFormer attends to during fine-tuning
π οΈ Dependencies
Key packages:
transformersdatasetswandbtorchanndatascikit-learn
π§ Acknowledgements
- Built with inspiration from DNABERT, Grover, and vision transformers