--- license: mit tags: - transformer - methylation - epigenomics - pretraining - masked-regression datasets: - custom language: en library_name: transformers model_name: MethFormer pipeline_tag: feature-extraction --- # ๐Ÿงš MethFormer: A Transformer for DNA Methylation **MethFormer** is a masked regression transformer model trained to learn local and long-range patterns in DNA methylation (5mC and 5hmC) across genomic regions. Pretrained on binned methylation data, it is designed for downstream fine-tuning on tasks such as predicting MLL binding or chromatin state. --- ## ๐Ÿš€ Overview * **Inputs**: Binned methylation values (5mC, 5hmC) over 1024bp windows (32 bins ร— 2 channels) * **Pretraining objective**: Masked methylation imputation (per-bin regression) * **Architecture**: Transformer encoder with linear projection head * **Downstream tasks**: MLL binding prediction, chromatin state inference, or enhancer classification --- ## ๐Ÿ“ Project Structure ``` . โ”œโ”€โ”€ config/ # config โ”œโ”€โ”€ data/ # Binned methylation datasets (HuggingFace format) โ”œโ”€โ”€ output/ # Pretrained models, logs, and checkpoints โ”œโ”€โ”€ scripts/ โ”‚ โ”œโ”€โ”€ methformer.py # Model classes, data collator, โ”‚ โ”œโ”€โ”€ pretrain_methformer.py # Main training script โ”‚ โ””โ”€โ”€ finetune_mll.py # (optional) downstream fine-tuning โ”œโ”€โ”€ requirements.txt โ””โ”€โ”€ README.md ``` --- ## ๐Ÿ‘ฉโ€๐Ÿ’ป Pretraining MethFormer ### Step 1: Prepare Dataset Preprocess 5mC and 5hmC data into 1024bp windows, binned into 32 bins ร— 2 features. Save using Hugging Face's `datasets.DatasetDict` format: ``` DatasetDict({ train: Dataset({ features: ['input_values', 'attention_mask', 'labels'] }), validation: Dataset(...) }) ``` ### Step 2: Run Pretraining ```bash python scripts/pretrain_methformer.py ``` Options can be customized inside the script or modified for sweep tuning. This will: * Train the model using masked regression loss * Evaluate on a held-out chromosome (e.g., `chr8`) * Log metrics to [Weights & Biases](https://wandb.ai) * Save the best model checkpoint --- ## ๐Ÿ“Š Metrics * `masked_mse`: Mean squared error over unmasked positions * `masked_mae`: Mean absolute error --- ## ๐Ÿงช Fine-tuning on MLL Binding After pretraining: 1. Replace the regression head with a scalar head for MLL prediction. 2. Use a `Trainer` to fine-tune on log1p-transformed MLL-N RPKM values mean over 1kb regions. See `scripts/finetune_mll.py` for an example. --- ## ๐Ÿ” Visualizations & Interpretability You can run [Captum](https://captum.ai) or SHAP for: * Per-bin attribution of 5mC/5hmC to MLL binding * Visualizing what MethFormer attends to during fine-tuning --- ## ๐Ÿ› ๏ธ Dependencies Key packages: * `transformers` * `datasets` * `wandb` * `torch` * `anndata` * `scikit-learn` --- ## ๐Ÿง  Acknowledgements * Built with inspiration from DNABERT, Grover, and vision transformers