Upload LOL-EVE production model v1.0 - adaptive embedding refactor

Browse files

Files changed (1) hide show

README.md +111 -3

README.md CHANGED Viewed

@@ -1,3 +1,111 @@
----
-license: mit
----

+---
+language:
+- en
+license: mit
+model-index:
+- name: Marks-lab/LOL-EVE
+  results:
+  - task:
+      type: text-generation
+      name: Genomic Sequence Modeling
+    dataset:
+      type: promoter_sequences
+      name: Mammalian Promoter Sequences
+    metrics:
+      - type: perplexity
+        value: 3.3182
+        name: Validation Perplexity
+  - task:
+      type: variant-effect-prediction
+      name: Promoter Variant Effect Prediction
+    dataset:
+      type: eqtl_benchmark
+      name: Causal eQTL Identification
+    metrics:
+      - type: accuracy
+        value: "State-of-the-art"
+        name: Benchmark Performance
+---
+# LOL-EVE: Language Of Life across EVolutionary Effects
+## Model Description
+LOL-EVE is a conditional autoregressive transformer model trained on 14.6 million diverse mammalian promoter sequences. It leverages evolutionary information and proximal genetic context to predict indel variant effects in human promoter regions.
+## Architecture
+- **Model Type**: Conditional Autoregressive Transformer
+- **Base Architecture**: CTRL (Conditional Transformer Language Model)
+- **Layers**: 12
+- **Embedding Dimension**: 768
+- **Attention Heads**: 12
+- **Max Sequence Length**: 1007
+- **Position Embedding**: adaptive
+## Training Data
+- **Dataset**: 14.6M mammalian promoter sequences
+- **Species Coverage**: Diverse mammalian species
+- **Sequence Length**: Up to 1000bp promoter regions
+- **Embeddings**: Pre-trained protein embeddings (ESM)
+## Performance
+The model achieves state-of-the-art performance on three key benchmarks:
+1. **Causal eQTL Identification**: Identifying causal variants in expression quantitative trait loci
+2. **Rare Variant Prioritization**: Prioritizing rare variants in human population data
+3. **TFBS Disruption**: Understanding transcription factor binding site disruptions
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("Marks-lab/LOL-EVE")
+model = AutoModelForCausalLM.from_pretrained("Marks-lab/LOL-EVE")
+# Example sequence
+sequence = "ATGCTAGCTAGCTAGCTAGCTA"
+inputs = tokenizer(sequence, return_tensors="pt")
+# Generate predictions
+outputs = model(**inputs)
+```
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@article{loleve2024,
+  title={LOL-EVE: Predicting Promoter Variant Effects from Evolutionary Sequences},
+  author={[Authors]},
+  journal={ICLR 2024},
+  year={2024}
+}
+```
+## License
+This model is licensed under the MIT License.
+## Model Details
+- **Training Framework**: PyTorch Lightning
+- **Optimizer**: Adam with cosine annealing
+- **Learning Rate**: 3e-05
+- **Weight Decay**: 0.01
+- **Batch Size**: 16
+- **Checkpoint**: model_epoch_epoch=01-val_all_control_perplexity_epoch=3.3182.ckpt
+## Limitations
+- Designed specifically for promoter region analysis
+- Requires appropriate genomic context for optimal performance
+- Performance may vary across different species and genomic regions
+## Contact
+For questions about this model, please open an issue in the repository.