| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - m42 |
| - genomics |
| - biology |
| - GFM |
| - BioFM |
| - BioToken |
| --- |
| |
| # BioFM: A Biologically-Informed Genomic Foundation Model |
| BioFM is a cutting-edge genomic foundation model that addresses critical limitations in existing genomic sequence modeling. By introducing BioToken, a novel tokenization framework, BioFM encodes genomic variants and structural annotations with unprecedented biological context, enabling more nuanced and accurate representation learning. |
|
|
|  |
|
|
| ## Model Highlights |
| - With the introduction of BioToken, we achieved competitive genomic prediction results using only 265 million parameters, significantly reducing computational requirements and training costs. |
| - Demonstrated superior performance compared to specialized models like Enformer and SpliceTransformer in critical genomic tasks, such as expression prediction and sQTL prediction, respectively. |
| - BioFM excels at various genomic tasks (e.g., expression prediction, coding/non-coding pathogenicity prediction, and sQTL prediction) that require long-range genomic contexts, outperforming existing GFMs. |
|
|
| ## Model Details |
| - **Model developers:** M42 Health AI Team |
| - **Base architecture:** [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM) |
| - **Context length:** |
| - **Training:** 6k tokens |
| - **Inference:** 12k tokens |
| - **Training data:** 1000 Genomes |
| - **Input format:** Annotated DNA sequences using BioToken |
| - **Output options:** |
| - DNA sequences only |
| - Embeddings |
| - **License:** CC BY-NC 4.0 |
| - **Publication:** [Paper link]() |
|
|
| ## Model Inference |
| We developed a BioFM-Eval Python package for inference and embedding extraction from genomic sequences. Refer to [BioFM-Eval](https://github.com/m42-health/biofm-eval/) library for setup and installation instructions. |
|
|
| ### Creating Variant Embeddings with BioFM |
|
|
| This guide will help you quickly generate BioFM embeddings for the variants in your VCF file. These embeddings are created using the method described in our publication. |
|
|
| ```python |
| from biofm_eval import AnnotatedModel, AnnotationTokenizer, Embedder, VCFConverter |
| import torch |
| |
| # Define paths to the pre-trained BioFM model and tokenizer |
| MODEL_PATH = "m42-health/BioFM-265M" |
| TOKENIZER_PATH = "m42-health/BioFM-265M" |
| |
| # Load the pre-trained BioFM model and BioToken tokenizer |
| model = AnnotatedModel.from_pretrained( |
| MODEL_PATH, |
| torch_dtype=torch.bfloat16, |
| ) |
| tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH) |
| |
| # Initialize the embedder using the model and tokenizer |
| embedder = Embedder(model, tokenizer) |
| |
| # Set up the VCF converter with paths to gene annotations and reference genome |
| vcf_converter = VCFConverter( |
| gene_annotation_path="./gencode.v38.annotation.gff3", |
| reference_genome_path="./GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna" |
| ) |
| |
| # Convert a VCF file into an annotated dataset using BioTokens |
| annotated_dataset = vcf_converter.vcf_to_annotated_dataset( |
| vcf_path = './HG01779_b.vcf.gz', |
| max_variants=200 # Set to None to process all variants in the VCF file |
| ) |
| |
| # Extract BioFM embeddings for all annotated variants |
| embeddings = embedder.get_dataset_embeddings(annotated_dataset) |
| print(embeddings) |
| |
| # Example output (dict): |
| # { |
| # 'embeddings': array of shape (num_variants, 2*embedding_dim), # Numeric embeddings for each variant |
| # 'labels': array of shape (num_variants,) # Present only during supervised embedding extraction |
| # } |
| |
| ``` |
| - Sample reference genome fasta file: [download link](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/) |
| - Gene annotation file: [download_link](https://www.gencodegenes.org/human/release_38.html) |
| - Sample vcf file from 1000 Genomes data: [download_link](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) |
|
|
|
|
| ### Generation with BioFM |
| BioFM can generate genomic sequences based on input DNA prompts. |
|
|
| ```python |
| from biofm_eval import AnnotatedModel, AnnotationTokenizer, Generator |
| import torch |
| |
| # Define paths to the pre-trained BioFM model and tokenizer |
| MODEL_PATH = "m42-health/BioFM-265M" |
| TOKENIZER_PATH = "m42-health/BioFM-265M" |
| |
| # Load the pre-trained BioFM model and BioToken tokenizer |
| model = AnnotatedModel.from_pretrained( |
| MODEL_PATH, |
| torch_dtype=torch.bfloat16, |
| ) |
| tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH) |
| |
| # Initializing the generator using model and tokenizer |
| seq_generator = Generator(model, tokenizer) |
| |
| # Generate DNA sequences |
| input_sequences = ['AGCT', 'GACTGCA'] |
| output = seq_generator.generate( |
| input_sequences, |
| max_new_tokens=10, |
| temperature=1.0, |
| do_sample=True, |
| top_k=4) |
| |
| print(output) |
| |
| # Example output: List[str] = ['AGCTACTCCCCTCC', 'GACTGCACCACTGTACT'] |
| |
| ``` |
|
|
| ## Training Setup |
|
|
| The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing PyTorch's Fully Sharded Data Parallel (FSDP) framework. |
|
|
| ## Evaluation Results |
|
|
| To demonstrate the effectiveness of BioToken, we evaluated BioFM against strong supervised baselines: Enformer for gene expression prediction and Splice Transformer for sQTL prediction. |
|
|
| - *Gene Expression Prediction:* BioFM matches Enformer's performance when both models use a 12K context, making it the first-ever GFM to achieve this. Notably, Enformer fails to reach this performance level even with a 98K context. |
| - *sQTL Prediction:* BioFM significantly outperforms Splice Transformer across all tissues, highlighting its robustness and generalizability. |
|
|
| | sQTL prediction | Expression prediction | |
| |---------|---------| |
| |  |  | |
|
|
| We further evaluated BioFM on the Variant Benchmark we curated and the Genomics Long-Range Benchmark. |
|
|
| - *Variant Benchmark:* Across a broad spectrum of [variant prediction tasks](https://huggingface.co/datasets/m42-health/variant-benchmark), BioFM outperforms other GFMs, showcasing its superior predictive capabilities. |
| - *Long-Range Genomic Dependencies:* On the Genomics Long-Range Benchmark, BioFM sets new performance standards, surpassing previous GFMs that required extensive fine-tuning and longer genomic contexts. This highlights BioFM’s ability to effectively capture and utilize long-range genomic dependencies. |
|
|
| | Variant benchmark | Genomics long-range benchmark | |
| |---------|---------| |
| |  |  | |
|
|
| Please go through the [paper]() for more resutls and ablations. |
|
|
| ## Citation |
| ``` |
| @article {Medvedev2025.03.27.645711, |
| author = {Medvedev, Aleksandr and Viswanathan, Karthik and Kanithi, Praveenkumar and Vishniakov, Kirill and Munjal, Prateek and Christophe, Clement and Pimentel, Marco AF and Rajan, Ronnie and Khan, Shadab}, |
| title = {BioToken and BioFM - Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models}, |
| elocation-id = {2025.03.27.645711}, |
| year = {2025}, |
| doi = {10.1101/2025.03.27.645711}, |
| publisher = {Cold Spring Harbor Laboratory}, |
| URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711}, |
| eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711.full.pdf}, |
| journal = {bioRxiv} |
| } |
| ``` |
|
|