## Puncta Prediction Benchmark This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the **puncta prediction benchmark** (Figure 3). ### From raw data to train/test splits To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory. ``` data/ └── raw_data/ ├── FOdb_puncta.csv ├── FOdb_SD5.csv benchmarking/ └── puncta/ ├── clean.py ├── cleaned_dataset_s4.csv ├── splits.csv ├── FOdb_physicochemical_embeddings.pkl ``` The `clean.py` script generates the following files: - **`cleaned_dataset_s4.csv`**: clean version of `FOdb_puncta.csv`, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained. - **`splits.csv`**: fusion oncoproteins from `cleaned_dataset_s4.csv`, labeled in the `split` column as either being part of the *train* set ("Expressed_Set" in FOdb) or *test* set ("Verification_Set" in FOdb). This dataset also features `nucleus`, `cytoplasm`, and `formation` columns of 1s and 0s. In `nucleus`, 1=forms a condensate in the nucleus, 0=does not; in `cytoplasm`, 1=forms a condensate in the cytoplasm, 0=does not; in `formation`, 1=forms a condensate at all, 0=does not. - **`FOdb_physicochemical_embeddings.pkl`**: a dictionary where fusion proteins from `splits.csv` are they keys, and their feature vectors of 25 low-MI features from `cleaned_dataset_s4.csv` are the values. ### Training `config.py` holds training configuations. ``` # Benchmarking configs BENCHMARK_FUSONPLM = True # True if you want to benchmark a FusOn-pLM Model # FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs # If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM" FUSONPLM_CKPTS= {} # Model comparison configs BENCHMARK_ESM = True # True if you want to benchmark ESM-2-650M BENCHMARK_PROTT5 = True # True if you want to benchmark ProtT5 BENCHMARK_FO_PUNCTA_ML = True # True if you want to benchmark FO-Puncta-ML from the FOdb paper # Overwriting configs PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made. # GPU configs CUDA_VISIBLE_DEVICES="0" # GPUs to make visible for this process ```
`train.py` will train the XGBoost classifiers. - All **results** are stored in `puncta/results/timestamp`, where `timestamp` is a unique string encoding the date and time when you started training. - All **embeddings** made for training will be stored in a new folder called `puncta/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings. ``` benchmarking/ └── puncta/ └── embeddings/ └── esm2_t33_650M_UR50D/... └── fuson_plm/... └── prot_t5_xl_half_uniref50_enc/... └── results/ └── final/ └── figures/ ├── cytoplasm_verificationFOs_barchart_source_data.csv ├── cytoplasm_verificationFOs_barchart.png ├── formation_verificationFOs_0.83thresh_barchart_source_data.csv ├── formation_verificationFOs_0.83thresh_barchart.png ├── nucleus_verificationFOs_barchart_source_data.csv ├── nucleus_verificationFOs_barchart.png ├── cytoplasm_verificationFOs_results.csv ├── formation_verificationFOs_0.83thresh_results.csv ├── nucleus_verificationFOs_results.csv ``` The following files are in `results/final/figures`: - **`cytoplasm_verificationFOs_barchart.png`**: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (`cytoplasm_verificationFOs_barchart_source_data.csv`) - **`formation_verificationFOs_0.83thresh_barchart.png`**: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (`formation_verificationFOs_0.83thresh_barchart_source_data.csv`) - **`nucleus_verificationFOs_barchart.png`**: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (`nucleus_verificationFOs_barchart_source_data.csv`) The raw data are included in `results/final` as `cytoplasm_verificationFOs_results.csv`, `formation_verificationFOs_0.83thresh_results.csv`, and `nucleus_verificationFOs_results.csv`. If you train a new model, the equivalents of these files will be created in `results/timestamp` for your specific configurations set in `config.py`. To run training, enter in terminal: ``` python train.py ``` To regenerate plots, run ``` python plot.py ```