Duplicate from m42-health/BioFM-265M

8b31bcc about 1 month ago

7.28 kB

	---
	license: cc-by-nc-4.0
	tags:
	- m42
	- genomics
	- biology
	- GFM
	- BioFM
	- BioToken
	---

	# BioFM: A Biologically-Informed Genomic Foundation Model
	BioFM is a cutting-edge genomic foundation model that addresses critical limitations in existing genomic sequence modeling. By introducing BioToken, a novel tokenization framework, BioFM encodes genomic variants and structural annotations with unprecedented biological context, enabling more nuanced and accurate representation learning.

	![BioFM](figures/biotoken_biofm.png)

	## Model Highlights
	- With the introduction of BioToken, we achieved competitive genomic prediction results using only 265 million parameters, significantly reducing computational requirements and training costs.
	- Demonstrated superior performance compared to specialized models like Enformer and SpliceTransformer in critical genomic tasks, such as expression prediction and sQTL prediction, respectively.
	- BioFM excels at various genomic tasks (e.g., expression prediction, coding/non-coding pathogenicity prediction, and sQTL prediction) that require long-range genomic contexts, outperforming existing GFMs.

	## Model Details
	- Model developers: M42 Health AI Team
	- Base architecture: [MistralForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/mistral#transformers.MistralForCausalLM)
	- Context length:
	- Training: 6k tokens
	- Inference: 12k tokens
	- Training data: 1000 Genomes
	- Input format: Annotated DNA sequences using BioToken
	- Output options:
	- DNA sequences only
	- Embeddings
	- License: CC BY-NC 4.0
	- Publication: [Paper link]()

	## Model Inference
	We developed a BioFM-Eval Python package for inference and embedding extraction from genomic sequences. Refer to [BioFM-Eval](https://github.com/m42-health/biofm-eval/) library for setup and installation instructions.

	### Creating Variant Embeddings with BioFM

	This guide will help you quickly generate BioFM embeddings for the variants in your VCF file. These embeddings are created using the method described in our publication.

	```python
	from biofm_eval import AnnotatedModel, AnnotationTokenizer, Embedder, VCFConverter
	import torch

	# Define paths to the pre-trained BioFM model and tokenizer
	MODEL_PATH = "m42-health/BioFM-265M"
	TOKENIZER_PATH = "m42-health/BioFM-265M"

	# Load the pre-trained BioFM model and BioToken tokenizer
	model = AnnotatedModel.from_pretrained(
	MODEL_PATH,
	torch_dtype=torch.bfloat16,
	)
	tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH)

	# Initialize the embedder using the model and tokenizer
	embedder = Embedder(model, tokenizer)

	# Set up the VCF converter with paths to gene annotations and reference genome
	vcf_converter = VCFConverter(
	gene_annotation_path="./gencode.v38.annotation.gff3",
	reference_genome_path="./GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna"
	)

	# Convert a VCF file into an annotated dataset using BioTokens
	annotated_dataset = vcf_converter.vcf_to_annotated_dataset(
	vcf_path = './HG01779_b.vcf.gz',
	max_variants=200 # Set to None to process all variants in the VCF file
	)

	# Extract BioFM embeddings for all annotated variants
	embeddings = embedder.get_dataset_embeddings(annotated_dataset)
	print(embeddings)

	# Example output (dict):
	# {
	# 'embeddings': array of shape (num_variants, 2*embedding_dim), # Numeric embeddings for each variant
	# 'labels': array of shape (num_variants,) # Present only during supervised embedding extraction
	# }

	```
	- Sample reference genome fasta file: [download link](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/)
	- Gene annotation file: [download_link](https://www.gencodegenes.org/human/release_38.html)
	- Sample vcf file from 1000 Genomes data: [download_link](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/)


	### Generation with BioFM
	BioFM can generate genomic sequences based on input DNA prompts.

	```python
	from biofm_eval import AnnotatedModel, AnnotationTokenizer, Generator
	import torch

	# Define paths to the pre-trained BioFM model and tokenizer
	MODEL_PATH = "m42-health/BioFM-265M"
	TOKENIZER_PATH = "m42-health/BioFM-265M"

	# Load the pre-trained BioFM model and BioToken tokenizer
	model = AnnotatedModel.from_pretrained(
	MODEL_PATH,
	torch_dtype=torch.bfloat16,
	)
	tokenizer = AnnotationTokenizer.from_pretrained(TOKENIZER_PATH)

	# Initializing the generator using model and tokenizer
	seq_generator = Generator(model, tokenizer)

	# Generate DNA sequences
	input_sequences = ['AGCT', 'GACTGCA']
	output = seq_generator.generate(
	input_sequences,
	max_new_tokens=10,
	temperature=1.0,
	do_sample=True,
	top_k=4)

	print(output)

	# Example output: List[str] = ['AGCTACTCCCCTCC', 'GACTGCACCACTGTACT']

	```

	## Training Setup

	The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing PyTorch's Fully Sharded Data Parallel (FSDP) framework.

	## Evaluation Results

	To demonstrate the effectiveness of BioToken, we evaluated BioFM against strong supervised baselines: Enformer for gene expression prediction and Splice Transformer for sQTL prediction.

	- Gene Expression Prediction: BioFM matches Enformer's performance when both models use a 12K context, making it the first-ever GFM to achieve this. Notably, Enformer fails to reach this performance level even with a 98K context.
	- sQTL Prediction: BioFM significantly outperforms Splice Transformer across all tissues, highlighting its robustness and generalizability.

	\| sQTL prediction \| Expression prediction \|
	\|---------\|---------\|
	\| ![Alt1](figures/sqtl_model_comparison.png) \| ![Alt2](figures/expression_model_comparison.png) \|

	We further evaluated BioFM on the Variant Benchmark we curated and the Genomics Long-Range Benchmark.

	- Variant Benchmark: Across a broad spectrum of [variant prediction tasks](https://huggingface.co/datasets/m42-health/variant-benchmark), BioFM outperforms other GFMs, showcasing its superior predictive capabilities.
	- Long-Range Genomic Dependencies: On the Genomics Long-Range Benchmark, BioFM sets new performance standards, surpassing previous GFMs that required extensive fine-tuning and longer genomic contexts. This highlights BioFM’s ability to effectively capture and utilize long-range genomic dependencies.

	\| Variant benchmark \| Genomics long-range benchmark \|
	\|---------\|---------\|
	\| ![Alt1](figures/vb_heatmap_and_barh_max.png) \| ![Alt2](figures/nt_lr_heatmap_and_barh.png) \|

	Please go through the [paper]() for more resutls and ablations.

	## Citation
	```
	@article {Medvedev2025.03.27.645711,
	author = {Medvedev, Aleksandr and Viswanathan, Karthik and Kanithi, Praveenkumar and Vishniakov, Kirill and Munjal, Prateek and Christophe, Clement and Pimentel, Marco AF and Rajan, Ronnie and Khan, Shadab},
	title = {BioToken and BioFM - Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models},
	elocation-id = {2025.03.27.645711},
	year = {2025},
	doi = {10.1101/2025.03.27.645711},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711},
	eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.03.27.645711.full.pdf},
	journal = {bioRxiv}
	}
	```