FusOn-pLM / fuson_plm /benchmarking /puncta /README.md

Sophia Vincoff

mutation prediction discovery and recovery

3efa812 12 months ago

5.36 kB

	## Puncta Prediction Benchmark

	This folder contains all the data and code needed to train FusOn-pLM-Puncta models and perform the puncta prediction benchmark (Figure 3).

	### From raw data to train/test splits
	To train the puncta predictors, we processed raw data from FOdb [(Tripathi et al. 2023)](https://doi.org/10.1038/s41467-023-41655-2) Supplementary dataset 4 (`fuson_plm/data/raw_data/FOdb_puncta.csv`) and Supplementary dataset 5 (`fuson_plm/data/raw_data/FODb_SD5.csv`) using the file `clean.py` in the `puncta` directory.

	```
	data/
	└── raw_data/
	├── FOdb_puncta.csv
	├── FOdb_SD5.csv

	benchmarking/
	└── puncta/
	├── clean.py
	├── cleaned_dataset_s4.csv
	├── splits.csv
	├── FOdb_physicochemical_embeddings.pkl
	```

	The `clean.py` script generates the following files:
	- `cleaned_dataset_s4.csv`: clean version of `FOdb_puncta.csv`, where fusion oncoproteins with puncta status "Other" or "Nucleolar" have been removed, and only the 25 low-MI features from `FOdb_SD5.csv' are retained.
	- `splits.csv`: fusion oncoproteins from `cleaned_dataset_s4.csv`, labeled in the `split` column as either being part of the train set ("Expressed_Set" in FOdb) or test set ("Verification_Set" in FOdb). This dataset also features `nucleus`, `cytoplasm`, and `formation` columns of 1s and 0s. In `nucleus`, 1=forms a condensate in the nucleus, 0=does not; in `cytoplasm`, 1=forms a condensate in the cytoplasm, 0=does not; in `formation`, 1=forms a condensate at all, 0=does not.
	- `FOdb_physicochemical_embeddings.pkl`: a dictionary where fusion proteins from `splits.csv` are they keys, and their feature vectors of 25 low-MI features from `cleaned_dataset_s4.csv` are the values.

	### Training

	`config.py` holds training configuations.

	```
	# Benchmarking configs
	BENCHMARK_FUSONPLM = True # True if you want to benchmark a FusOn-pLM Model

	# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
	# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
	FUSONPLM_CKPTS= {}

	# Model comparison configs
	BENCHMARK_ESM = True # True if you want to benchmark ESM-2-650M
	BENCHMARK_PROTT5 = True # True if you want to benchmark ProtT5
	BENCHMARK_FO_PUNCTA_ML = True # True if you want to benchmark FO-Puncta-ML from the FOdb paper

	# Overwriting configs
	PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made.

	# GPU configs
	CUDA_VISIBLE_DEVICES="0" # GPUs to make visible for this process
	```
	<br>

	`train.py` will train the XGBoost classifiers.
	- All results are stored in `puncta/results/timestamp`, where `timestamp` is a unique string encoding the date and time when you started training.
	- All embeddings made for training will be stored in a new folder called `puncta/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

	```
	benchmarking/
	└── puncta/
	└── embeddings/
	└── esm2_t33_650M_UR50D/...
	└── fuson_plm/...
	└── prot_t5_xl_half_uniref50_enc/...
	└── results/
	└── final/
	└── figures/
	├── cytoplasm_verificationFOs_barchart_source_data.csv
	├── cytoplasm_verificationFOs_barchart.png
	├── formation_verificationFOs_0.83thresh_barchart_source_data.csv
	├── formation_verificationFOs_0.83thresh_barchart.png
	├── nucleus_verificationFOs_barchart_source_data.csv
	├── nucleus_verificationFOs_barchart.png
	├── cytoplasm_verificationFOs_results.csv
	├── formation_verificationFOs_0.83thresh_results.csv
	├── nucleus_verificationFOs_results.csv
	```

	The following files are in `results/final/figures`:
	- `cytoplasm_verificationFOs_barchart.png`: bar chart of performance on the cytoplasm puncta prediction task (Fig. 3E), and the formatted data that went directly into the plot (`cytoplasm_verificationFOs_barchart_source_data.csv`)
	- `formation_verificationFOs_0.83thresh_barchart.png`: bar chart of performance on the puncta formation prediction task (Fig. 3C), and the formatted data that went directly into the plot (`formation_verificationFOs_0.83thresh_barchart_source_data.csv`)
	- `nucleus_verificationFOs_barchart.png`: bar chart of performance on the nucleus puncta prediction task (Fig. 3D), and the formatted data that went directly into the plot (`nucleus_verificationFOs_barchart_source_data.csv`)

	The raw data are included in `results/final` as `cytoplasm_verificationFOs_results.csv`, `formation_verificationFOs_0.83thresh_results.csv`, and `nucleus_verificationFOs_results.csv`.

	If you train a new model, the equivalents of these files will be created in `results/timestamp` for your specific configurations set in `config.py`.

	To run training, enter in terminal:
	```
	python train.py
	```

	To regenerate plots, run
	```
	python plot.py
	```