| ## Puncta Prediction Benchmark | |
| This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the **puncta prediction benchmark** (Figure 3). | |
| ### From raw data to train/test splits | |
| To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory. | |
| ``` | |
| data/ | |
| βββ raw_data/ | |
| βββ FOdb_puncta.csv | |
| βββ FOdb_SD5.csv | |
| benchmarking/ | |
| βββ puncta/ | |
| βββ clean.py | |
| βββ cleaned_dataset_s4.csv | |
| βββ splits.csv | |
| βββ FOdb_physicochemical_embeddings.pkl | |
| ``` | |
| The `clean.py` script generates the following files: | |
| - **`cleaned_dataset_s4.csv`**: clean version of `FOdb_puncta.csv`, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained. | |
| - **`splits.csv`**: fusion oncoproteins from `cleaned_dataset_s4.csv`, labeled in the `split` column as either being part of the *train* set ("Expressed_Set" in FOdb) or *test* set ("Verification_Set" in FOdb). This dataset also features `nucleus`, `cytoplasm`, and `formation` columns of 1s and 0s. In `nucleus`, 1=forms a condensate in the nucleus, 0=does not; in `cytoplasm`, 1=forms a condensate in the cytoplasm, 0=does not; in `formation`, 1=forms a condensate at all, 0=does not. | |
| - **`FOdb_physicochemical_embeddings.pkl`**: a dictionary where fusion proteins from `splits.csv` are they keys, and their feature vectors of 25 low-MI features from `cleaned_dataset_s4.csv` are the values. | |
| ### Training | |
| `config.py` holds training configuations. | |
| ``` | |
| # Benchmarking configs | |
| BENCHMARK_FUSONPLM = True # True if you want to benchmark a FusOn-pLM Model | |
| # FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs | |
| # If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM" | |
| FUSONPLM_CKPTS= {} | |
| # Model comparison configs | |
| BENCHMARK_ESM = True # True if you want to benchmark ESM-2-650M | |
| BENCHMARK_PROTT5 = True # True if you want to benchmark ProtT5 | |
| BENCHMARK_FO_PUNCTA_ML = True # True if you want to benchmark FO-Puncta-ML from the FOdb paper | |
| # Overwriting configs | |
| PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made. | |
| # GPU configs | |
| CUDA_VISIBLE_DEVICES="0" # GPUs to make visible for this process | |
| ``` | |
| <br> | |
| `train.py` will train the XGBoost classifiers. | |
| - All **results** are stored in `puncta/results/timestamp`, where `timestamp` is a unique string encoding the date and time when you started training. | |
| - All **embeddings** made for training will be stored in a new folder called `puncta/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings. | |
| ``` | |
| benchmarking/ | |
| βββ puncta/ | |
| βββ embeddings/ | |
| βββ esm2_t33_650M_UR50D/... | |
| βββ fuson_plm/... | |
| βββ prot_t5_xl_half_uniref50_enc/... | |
| βββ results/ | |
| βββ final/ | |
| βββ figures/ | |
| βββ cytoplasm_verificationFOs_barchart_source_data.csv | |
| βββ cytoplasm_verificationFOs_barchart.png | |
| βββ formation_verificationFOs_0.83thresh_barchart_source_data.csv | |
| βββ formation_verificationFOs_0.83thresh_barchart.png | |
| βββ nucleus_verificationFOs_barchart_source_data.csv | |
| βββ nucleus_verificationFOs_barchart.png | |
| βββ cytoplasm_verificationFOs_results.csv | |
| βββ formation_verificationFOs_0.83thresh_results.csv | |
| βββ nucleus_verificationFOs_results.csv | |
| ``` | |
| The following files are in `results/final/figures`: | |
| - **`cytoplasm_verificationFOs_barchart.png`**: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (`cytoplasm_verificationFOs_barchart_source_data.csv`) | |
| - **`formation_verificationFOs_0.83thresh_barchart.png`**: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (`formation_verificationFOs_0.83thresh_barchart_source_data.csv`) | |
| - **`nucleus_verificationFOs_barchart.png`**: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (`nucleus_verificationFOs_barchart_source_data.csv`) | |
| The raw data are included in `results/final` as `cytoplasm_verificationFOs_results.csv`, `formation_verificationFOs_0.83thresh_results.csv`, and `nucleus_verificationFOs_results.csv`. | |
| If you train a new model, the equivalents of these files will be created in `results/timestamp` for your specific configurations set in `config.py`. | |
| To run training, enter in terminal: | |
| ``` | |
| python train.py | |
| ``` | |
| To regenerate plots, run | |
| ``` | |
| python plot.py | |
| ``` | |