|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- genomics |
|
|
- dna |
|
|
- language-model |
|
|
- causal-lm |
|
|
- biology |
|
|
- sequence-modeling |
|
|
- variant-prediction |
|
|
- promoter |
|
|
- indel |
|
|
- eqtl |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# LOL-EVE: A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects |
|
|
|
|
|
## Model Description |
|
|
|
|
|
LOL-EVE is a transformer-based model that processes DNA sequences with control codes to predict variant effects. The model was trained on 13.6 million mammalian promoter sequences and demonstrates state-of-the-art performance on promoter indel prediction tasks. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Large vocabulary**: 39,378 tokens including DNA bases, control codes, and special tokens |
|
|
- **Control code integration**: Incorporates gene, species, and clade information |
|
|
- **Protein context**: Uses pre-trained ESM embeddings for gene-specific understanding |
|
|
- **Flexible input format**: Supports both basic DNA sequences and control code sequences |
|
|
- **Zero-shot prediction**: Enables prediction of indel effects without task-specific training |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained('Marks-lab/LOL-EVE') |
|
|
model = AutoModelForCausalLM.from_pretrained('Marks-lab/LOL-EVE', trust_remote_code=True) |
|
|
|
|
|
# Basic DNA sequence |
|
|
sequence = "[MASK] [MASK] [MASK] [SOS]ATGCTAGCTAGCTAGCTAGCTA[EOS]" |
|
|
inputs = tokenizer(sequence, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
### With Control Codes (Recommended) |
|
|
|
|
|
```python |
|
|
# Control code sequence (recommended) |
|
|
control_sequence = "brca1 human primate [SOS] ATGCTAGCTAGCTAGCTAGCTA [EOS]" |
|
|
inputs = tokenizer(control_sequence, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
### Variant Scoring |
|
|
|
|
|
```python |
|
|
import pandas as pd |
|
|
import torch |
|
|
|
|
|
def score_variants_hf(variants_df, gene, species, clade): |
|
|
""" |
|
|
Score variants using the Hugging Face model. |
|
|
|
|
|
Args: |
|
|
variants_df: DataFrame with columns ['sequence', 'variant_sequence'] |
|
|
gene: Gene name (e.g., 'brca1') |
|
|
species: Species name (e.g., 'human') |
|
|
clade: Clade information (e.g., 'primate') |
|
|
|
|
|
Returns: |
|
|
DataFrame with added 'score' column |
|
|
""" |
|
|
scores = [] |
|
|
|
|
|
for _, row in variants_df.iterrows(): |
|
|
# Create control code sequences |
|
|
ref_seq = f"{gene} {species} {clade} [SOS] {row['sequence']} [EOS]" |
|
|
var_seq = f"{gene} {species} {clade} [SOS] {row['variant_sequence']} [EOS]" |
|
|
|
|
|
# Tokenize sequences |
|
|
ref_inputs = tokenizer(ref_seq, return_tensors="pt") |
|
|
var_inputs = tokenizer(var_seq, return_tensors="pt") |
|
|
|
|
|
# Get model outputs |
|
|
with torch.no_grad(): |
|
|
ref_outputs = model(**ref_inputs) |
|
|
var_outputs = model(**var_inputs) |
|
|
|
|
|
# Calculate log-likelihood scores |
|
|
ref_logits = ref_outputs.logits[0, :-1] # Exclude last token |
|
|
var_logits = var_outputs.logits[0, :-1] |
|
|
|
|
|
ref_tokens = ref_inputs['input_ids'][0, 1:] # Exclude first token |
|
|
var_tokens = var_inputs['input_ids'][0, 1:] |
|
|
|
|
|
# Calculate sequence likelihood |
|
|
ref_score = torch.nn.functional.cross_entropy(ref_logits, ref_tokens, reduction='sum') |
|
|
var_score = torch.nn.functional.cross_entropy(var_logits, var_tokens, reduction='sum') |
|
|
|
|
|
# Score is the difference (higher = more deleterious) |
|
|
score = (var_score - ref_score).item() |
|
|
scores.append(score) |
|
|
|
|
|
variants_df['score'] = scores |
|
|
return variants_df |
|
|
|
|
|
# Example usage |
|
|
variants = pd.DataFrame({ |
|
|
'sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'], |
|
|
'variant_sequence': ['ATGCTAGCTAGCTAGCTAGCTA', 'ATGCTAGCTAGCTAGCTAGCTA'] # Example variants |
|
|
}) |
|
|
|
|
|
scored_variants = score_variants_hf(variants, gene='brca1', species='human', clade='primate') |
|
|
print(scored_variants) |
|
|
``` |
|
|
|
|
|
### Input Format |
|
|
|
|
|
The model expects sequences in the format: |
|
|
``` |
|
|
gene species clade [SOS] sequence [EOS] |
|
|
``` |
|
|
|
|
|
Where: |
|
|
- `gene`: Gene name (e.g., "brca1", "tp53") |
|
|
- `species`: Species name (e.g., "human", "mouse") |
|
|
- `clade`: Clade information (e.g., "primate", "mammal") |
|
|
- `[SOS]`: Start of sequence token |
|
|
- `sequence`: DNA sequence (A, T, G, C) |
|
|
- `[EOS]`: End of sequence token |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Model type**: Causal Language Model (CTRL-based) |
|
|
- **Layers**: 12 transformer layers |
|
|
- **Hidden size**: 768 dimensions |
|
|
- **Attention heads**: 12 |
|
|
- **Vocabulary size**: 39,378 tokens |
|
|
- **Max sequence length**: 1,007 tokens |
|
|
- **Position embeddings**: Adaptive local position embeddings |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on genomic sequences with: |
|
|
- DNA sequences up to 1000 base pairs |
|
|
- Gene-specific control codes |
|
|
- Species and clade information |
|
|
- Pre-trained ESM protein embeddings |
|
|
- 13.6 million mammalian promoter sequences |
|
|
|
|
|
## Performance |
|
|
|
|
|
LOL-EVE demonstrates state-of-the-art performance on: |
|
|
|
|
|
### Benchmarks |
|
|
- **Ultra-rare variant prioritization**: Prioritizing ultra-rare variants in gnomAD |
|
|
- **Causal eQTL identification**: Identifying causal expression quantitative trait loci |
|
|
- **Transcription factor binding site disruption**: Analyzing TFBS disruption by indels |
|
|
|
|
|
|
|
|
## Datasets |
|
|
|
|
|
- **[LOL-EVE-UltraRare](https://huggingface.co/datasets/Marks-lab/LOL-EVE-UltraRare)** - Ultra-rare variant benchmark dataset |
|
|
- **[LOL-EVE-eQTL_benchmark](https://huggingface.co/datasets/Marks-lab/LOL-EVE-eQTL_benchmark)** - eQTL benchmark dataset |
|
|
- **[PromoterZoo Training Data](https://huggingface.co/datasets/Marks-lab/PromoterZoo/blob/main/README.md)** - PromoterZoo Training Data |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use LOL-EVE in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{loleve2025, |
|
|
title={A Genomic Language Model for Zero-Shot Prediction of Promoter Variant Effects}, |
|
|
author={[Authors]}, |
|
|
journal={MLCB 2025}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the MIT License. See the [LICENSE](https://github.com/Marks-lab/LOL-EVE/blob/main/LICENSE) file for more details. |
|
|
|
|
|
## Repository |
|
|
|
|
|
- **GitHub**: [https://github.com/debbiemarkslab/LOL-EVE](https://github.com/debbiemarkslab/LOL-EVE) |
|
|
- **Paper**: [MLCB 2025 version coming end of month!](https://www.biorxiv.org/content/10.1101/2024.11.11.623015v1) (link to be updated) |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please contact [your-email@domain.com] or open an issue on the [GitHub repository](https://github.com/Marks-lab/LOL-EVE). |
|
|
|