How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("feature-extraction", model="CChahrour/Methformer")
# Load model directly
from transformers import Methformer
model = Methformer.from_pretrained("CChahrour/Methformer", dtype="auto")
Quick Links

🧚 MethFormer: A Transformer for DNA Methylation

MethFormer is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state.


πŸš€ Overview

  • Inputs: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins Γ— 2 channels)
  • Pretraining objective: Masked methylation imputation (per-bin regression)
  • Architecture: Transformer encoder with linear projection head
  • Downstream tasks: MLL binding prediction, chromatin state inference, or enhancer classification

πŸ“ Project Structure

.
β”œβ”€β”€ config/                       # config
β”œβ”€β”€ data/                         # Binned methylation datasets (HuggingFace format)
β”œβ”€β”€ output/                       # Pretrained models, logs, and checkpoints
β”œβ”€β”€ scripts/                      
β”‚   β”œβ”€β”€ methformer.py             # Model classes, data collator, 
β”‚   β”œβ”€β”€ pretrain_methformer.py    # Main training script
β”‚   └── finetune_mll.py           # (optional) downstream fine-tuning
β”œβ”€β”€ requirements.txt
└── README.md

πŸ‘©β€πŸ’» Pretraining MethFormer

Step 1: Prepare Dataset

Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins Γ— 2 features. Save using Hugging Face's datasets.DatasetDict format:

DatasetDict({
  train: Dataset({
    features: ['input_values', 'attention_mask', 'labels']
  }),
  validation: Dataset(...)
})

Step 2: Run Pretraining

python scripts/pretrain_methformer.py

Options can be customized inside the script or modified for sweep tuning. This will:

  • Train the model using masked regression loss
  • Evaluate on a held-out chromosome (e.g., chr8)
  • Log metrics to Weights & Biases
  • Save the best model checkpoint

πŸ“Š Metrics

  • masked_mse: Mean squared error over unmasked positions
  • masked_mae: Mean absolute error

πŸ§ͺ Fine-tuning on MLL Binding

After pretraining:

  1. Replace the regression head with a scalar head for MLL prediction.
  2. Use a Trainer to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions.

See scripts/finetune_mll.py for an example.


πŸ” Visualizations & Interpretability

You can run Captum or SHAP for:

  • Per-bin attribution of 5mC/5hmC to MLL binding
  • Visualizing what MethFormer attends to during fine-tuning

πŸ› οΈ Dependencies

Key packages:

  • transformers
  • datasets
  • wandb
  • torch
  • anndata
  • scikit-learn

🧠 Acknowledgements

  • Built with inspiration from DNABERT, Grover, and vision transformers
Downloads last month
10
Safetensors
Model size
7.25M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including CChahrour/Methformer