README.md · Marks-lab/LOL-EVE at main

LOL-EVE / README.md

cshearer

Update Training Data

397bbb8 verified 4 months ago

preview code

raw

history blame contribute delete

6.46 kB

	---
	license: mit
	tags:
	- genomics
	- dna
	- language-model
	- causal-lm
	- biology
	- sequence-modeling
	- variant-prediction
	- promoter
	- indel
	- eqtl
	pipeline_tag: text-generation
	library_name: transformers
	---

	# LOL-EVE: A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects

	## Model Description

	LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks.

	### Key Features

	- Large vocabulary: 39,378 tokens including DNA bases, control codes, and special tokens
	- Control code integration: Incorporates gene, species, and clade information
	- Protein context: Uses pre-trained ESM embeddings for gene-specific understanding
	- Flexible input format: Supports both basic DNA sequences and control code sequences
	- Zero-shot prediction: Enables prediction of indel effects without task-specific training

	## Usage

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE')
	model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True)

	# Basic DNA sequence
	sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]"
	inputs = tokenizer(sequence, return_tensors="pt")
	outputs = model(**inputs)
	```

	### With Control Codes (Recommended)

	```python
	# Control code sequence (recommended)
	control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]"
	inputs = tokenizer(control_sequence, return_tensors="pt")
	outputs = model(**inputs)
	```

	### Variant Scoring

	```python
	import pandas as pd
	import torch

	def score_variants_hf(variants_df, gene, species, clade):
	"""
	Score variants using the Hugging Face model.

	Args:
	variants_df: DataFrame with columns ['sequence', 'variant_sequence']
	gene: Gene name (e.g., 'brca1')
	species: Species name (e.g., 'human')
	clade: Clade information (e.g., 'primate')

	Returns:
	DataFrame with added 'score' column
	"""
	scores = []

	for _, row in variants_df.iterrows():
	# Create control code sequences
	ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]"
	var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]"

	# Tokenize sequences
	ref_inputs = tokenizer(ref_seq, return_tensors="pt")
	var_inputs = tokenizer(var_seq, return_tensors="pt")

	# Get model outputs
	with torch.no_grad():
	ref_outputs = model(**ref_inputs)
	var_outputs = model(**var_inputs)

	# Calculate log-likelihood scores
	ref_logits = ref_outputs.logits[0, :-1] # Exclude last token
	var_logits = var_outputs.logits[0, :-1]

	ref_tokens = ref_inputs['input_ids'][0, 1:] # Exclude first token
	var_tokens = var_inputs['input_ids'][0, 1:]

	# Calculate sequence likelihood
	ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum')
	var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum')

	# Score is the difference (higher = more deleterious)
	score = (var_score - ref_score).item()
	scores.append(score)

	variants_df['score'] = scores
	return variants_df

	# Example usage
	variants = pd.DataFrame({
	'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'],
	'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'] # Example variants
	})

	scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate')
	print(scored_variants)
	```

	### Input Format

	The model expects sequences in the format:
	```
	gene species clade [SOS] sequence [EOS]
	```

	Where:
	- `gene`: Gene name (e.g., "brca1", "tp53")
	- `species`: Species name (e.g., "human", "mouse")
	- `clade`: Clade information (e.g., "primate", "mammal")
	- `[SOS]`: Start of sequence token
	- `sequence`: DNA sequence (A, T, G, C)
	- `[EOS]`: End of sequence token

	## Model Architecture

	- Model type: Causal Language Model (CTRL-based)
	- Layers: 12 transformer layers
	- Hidden size: 768 dimensions
	- Attention heads: 12
	- Vocabulary size: 39,378 tokens
	- Max sequence length: 1,007 tokens
	- Position embeddings: Adaptive local position embeddings

	## Training Data

	The model was trained on genomic sequences with:
	- DNA sequences up to 1000 base pairs
	- Gene-specific control codes
	- Species and clade information
	- Pre-trained ESM protein embeddings
	- 13.6 million mammalian promoter sequences

	## Performance

	LOL-EVE demonstrates state-of-the-art performance on:

	### Benchmarks
	- Ultra-rare variant prioritization: Prioritizing ultra-rare variants in gnomAD
	- Causal eQTL identification: Identifying causal expression quantitative trait loci
	- Transcription factor binding site disruption: Analyzing TFBS disruption by indels


	## Datasets

	- [LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare) - Ultra-rare variant benchmark dataset
	- [LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark) - eQTL benchmark dataset
	- [PromoterZoo Training Data](https://huggingface.co/datasets/Marks-lab/PromoterZoo/blob/main/README.md) - PromoterZoo Training Data

	## Citation

	If you use LOL-EVE in your research, please cite:

	```bibtex
	@article{loleve2025,
	title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects},
	author={[Authors]},
	journal={MLCB 2025},
	year={2025}
	}
	```

	## License

	This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details.

	## Repository

	- GitHub: [https://github.com/debbiemarkslab/LOL-EVE](https://github.com/debbiemarkslab/LOL-EVE)
	- Paper: [MLCB 2025 version coming end of month!](https://www.biorxiv.org/content/10.1101/2024.11.11.623015v1) (link to be updated)

	## Contact

	For questions or issues, please contact [your-email@domain.com] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE).