cshearer commited on
Commit
1762bf2
·
verified ·
1 Parent(s): 5a02a2c

Upload LOL-EVE production model v1.0 - adaptive embedding refactor

Browse files
Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ model-index:
6
+ - name: Marks-lab/LOL-EVE
7
+ results:
8
+ - task:
9
+ type: text-generation
10
+ name: Genomic Sequence Modeling
11
+ dataset:
12
+ type: promoter_sequences
13
+ name: Mammalian Promoter Sequences
14
+ metrics:
15
+ - type: perplexity
16
+ value: 3.3182
17
+ name: Validation Perplexity
18
+ - task:
19
+ type: variant-effect-prediction
20
+ name: Promoter Variant Effect Prediction
21
+ dataset:
22
+ type: eqtl_benchmark
23
+ name: Causal eQTL Identification
24
+ metrics:
25
+ - type: accuracy
26
+ value: "State-of-the-art"
27
+ name: Benchmark Performance
28
+ ---
29
+
30
+ # LOL-EVE: Language Of Life across EVolutionary Effects
31
+
32
+ ## Model Description
33
+
34
+ LOL-EVE is a conditional autoregressive transformer model trained on 14.6 million diverse mammalian promoter sequences. It leverages evolutionary information and proximal genetic context to predict indel variant effects in human promoter regions.
35
+
36
+ ## Architecture
37
+
38
+ - **Model Type**: Conditional Autoregressive Transformer
39
+ - **Base Architecture**: CTRL (Conditional Transformer Language Model)
40
+ - **Layers**: 12
41
+ - **Embedding Dimension**: 768
42
+ - **Attention Heads**: 12
43
+ - **Max Sequence Length**: 1007
44
+ - **Position Embedding**: adaptive
45
+
46
+ ## Training Data
47
+
48
+ - **Dataset**: 14.6M mammalian promoter sequences
49
+ - **Species Coverage**: Diverse mammalian species
50
+ - **Sequence Length**: Up to 1000bp promoter regions
51
+ - **Embeddings**: Pre-trained protein embeddings (ESM)
52
+
53
+ ## Performance
54
+
55
+ The model achieves state-of-the-art performance on three key benchmarks:
56
+ 1. **Causal eQTL Identification**: Identifying causal variants in expression quantitative trait loci
57
+ 2. **Rare Variant Prioritization**: Prioritizing rare variants in human population data
58
+ 3. **TFBS Disruption**: Understanding transcription factor binding site disruptions
59
+
60
+ ## Usage
61
+
62
+ ```python
63
+ from transformers import AutoTokenizer, AutoModelForCausalLM
64
+
65
+ # Load tokenizer and model
66
+ tokenizer = AutoTokenizer.from_pretrained("Marks-lab/LOL-EVE")
67
+ model = AutoModelForCausalLM.from_pretrained("Marks-lab/LOL-EVE")
68
+
69
+ # Example sequence
70
+ sequence = "ATGCTAGCTAGCTAGCTAGCTA"
71
+ inputs = tokenizer(sequence, return_tensors="pt")
72
+
73
+ # Generate predictions
74
+ outputs = model(**inputs)
75
+ ```
76
+
77
+ ## Citation
78
+
79
+ If you use this model in your research, please cite:
80
+
81
+ ```bibtex
82
+ @article{loleve2024,
83
+ title={LOL-EVE: Predicting Promoter Variant Effects from Evolutionary Sequences},
84
+ author={[Authors]},
85
+ journal={ICLR 2024},
86
+ year={2024}
87
+ }
88
+ ```
89
+
90
+ ## License
91
+
92
+ This model is licensed under the MIT License.
93
+
94
+ ## Model Details
95
+
96
+ - **Training Framework**: PyTorch Lightning
97
+ - **Optimizer**: Adam with cosine annealing
98
+ - **Learning Rate**: 3e-05
99
+ - **Weight Decay**: 0.01
100
+ - **Batch Size**: 16
101
+ - **Checkpoint**: model_epoch_epoch=01-val_all_control_perplexity_epoch=3.3182.ckpt
102
+
103
+ ## Limitations
104
+
105
+ - Designed specifically for promoter region analysis
106
+ - Requires appropriate genomic context for optimal performance
107
+ - Performance may vary across different species and genomic regions
108
+
109
+ ## Contact
110
+
111
+ For questions about this model, please open an issue in the repository.