Created ReadMe
Browse files
README.md
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Fine-Tuning RNA Language Models to Predict Branch Points
|
| 2 |
+
This repository contains several fine-tuned RNA language models for predicting branch points within intronic sequences. The models are fine-tuned using the [MultiMolecule library](https://multimolecule.danling.org/) and evaluated on experimental datasets.
|
| 3 |
+
|
| 4 |
+
The following RNA language models were fine-tuned:
|
| 5 |
+
- SpliceBERT
|
| 6 |
+
- RNABERT
|
| 7 |
+
- RNA-FM
|
| 8 |
+
- RNA-MSM
|
| 9 |
+
- ERNIE-RNA
|
| 10 |
+
- UTR-LM
|
| 11 |
+
|
| 12 |
+
The dataset contains **177980 samples** and is an experimental-data only subset of the dataset used to train [BPHunter](https://www.pnas.org/doi/abs/10.1073/pnas.2211194119?url_ver=Z39.88-2003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub++0pubmed).
|
| 13 |
+
|
| 14 |
+
It has been split into approximately **80/10/10 train/validation/test** by chromosome type:
|
| 15 |
+
- Train: `chr1`, `chr2`, `chr3`, `chr4`, `chr5`, `chr6`, `chr7`, `chr12`, `chr13`, `chr14`, `chr15`, `chr16`, `chr17`, `chr18`, `chr19`, `chr20`, `chr21`, `chr22`, `chrX`, `chrY`,
|
| 16 |
+
- Validation: `chr9`, `chr10`
|
| 17 |
+
- Test: `chr8`, `chr11`
|
| 18 |
+
|
| 19 |
+
## Training Details
|
| 20 |
+
|
| 21 |
+
Each model was trained on the full dataset for 3 epochs with a batch size of 16, except for RNA-FM, which required a reduced batch size of 12 due to VRAM limitations. The following hyperparameters were used for most models, including RNABERT, RNA-FM, RNA-MSM, and UTR-LM:
|
| 22 |
+
- Optimizer: AdamW
|
| 23 |
+
- Learning rate: 3e-4
|
| 24 |
+
- Weight decay: 0.001
|
| 25 |
+
|
| 26 |
+
However, SpliceBERT and ERNIE-RNA failed to converge using these parameters. To address this, we adjusted the hyperparmeters to:
|
| 27 |
+
- Learning rate: 2e-5
|
| 28 |
+
- Weight decay of 0.01
|
| 29 |
+
|
| 30 |
+
The adjustments were made based on empirical observations during early training.
|
| 31 |
+
While ideally, comprehensive hyperparameter tuning would be done for each model to optimize perforance, this was not feasible within the scope of the project due to the high computational cost and training time required.
|
| 32 |
+
|
| 33 |
+
## GitHub
|
| 34 |
+
All code used to create and evaluate this model can be found at [this link](https://github.com/AliSaadatV/BP_LM/tree/main).
|