IDR Property Prediction Benchmark
This folder contains all the data and code needed to perform the IDR property prediction benchmark, where FusOn-pLM-IDR (a regressor built on FusOn-pLM embeddings) is used to predict aggregate properties of intrinsically disordered regions (IDRs), specifically asphericity, end-to-end radius (Re), radius of gyration (Rg), and polymer scaling exponent (Figure 4A-B).
TL;DR
The order in which to run the scripts, after downloading data:
python clean.py # clean the data
python cluster.py # MMSeqs2 clustering
python split.py # make cluster-based train/val/test splits
python train.py # train the model
python plot.py # if you want to remake r2 plots
Downloading raw IDR data
IDR properties from Lotthammer et al. 2024 (ALBATROSS model) were used to train FusOn-pLM-IDR. Sequences were downloaded from this link and deposited in raw_data. All files in raw_data are from this direct download.
benchmarking/
βββ idr_prediction/
βββ raw_data/
βββ asph_bio_synth_training_data_cleaned_05_09_2023.tsv
βββ asph_nat_meth_test.tsv
βββ scaled_re_bio_synth_training_data_cleaned_05_09_2023.tsv
βββ scaled_re_nat_meth_test.tsv
βββ scaled_rg_bio_synth_training_data_cleaned_05_09_2023.tsv
βββ scaled_rg_nat_meth_test.tsv
βββ scaling_exp_bio_synth_training_data_cleaned_05_09_2023.tsv
βββ scaling_exp_nat_meth_test.tsv
asph=asphericity,scaled_re=scaled Re,scaled_rg=scaled Rg,scaling_exp=polymer scaling exponent<property>_bio_synth_training_data_cleaned_05_09_2023.tsvare ALBATROSS training data for the four properties, downloaded directly their GitHub<property>_nat_meth_test.tsvare ALBATROSS testing data for the four proeprties, downloaded directly from their GitHub
Cleaning raw IDR data
clean.py cleans the raw training and testing data separately for each property. Any duplicates (in both train and test) are removed from train and kept in test. Finally, the four are combined into one file:
benchmarking/
βββ idr_prediction/
βββ processed_data/
βββ all_albatross_seqs_and_properties.csv
all_albatross_seqs_and_properties.csv: Columns = "Sequence","IDs","UniProt_IDs","UniProt_Names","Split","asph","scaled_re","scaled_rg","scaling_exp". All splits are either "Train" or "Test", indicating ALBATROSS model's usage of them
To perform cleaning, run
python clean.py
Using config.py for clustering, splitting, training
This file has configurations for clustering, splitting, training.
# Clustering Parameters
CLUSTER = CustomParams(
# MMSeqs2 parameters: see GitHub or MMSeqs2 Wiki for guidance
MIN_SEQ_ID = 0.3, # % identity
C = 0.5, # % sequence length overlap
COV_MODE = 1, # cov-mode: 0 = bidirectional, 1 = target coverage, 2 = query coverage, 3 = target-in-query length coverage.
CLUSTER_MODE = 2,
# File paths
INPUT_PATH = 'processed_data/all_albatross_seqs_and_properties.csv',
PATH_TO_MMSEQS = '../../mmseqs' # path to where you installed MMSeqs2
)
# Split config
SPLIT = CustomParams(
IDR_DB_PATH = 'processed_data/all_albatross_seqs_and_properties.csv',
CLUSTER_OUTPUT_PATH = 'clustering/mmseqs_full_results.csv',
RANDOM_STATE_1 = 2, # random_state_1 = state for splitting all data into train & other
TEST_SIZE_1 = 0.21, # test size for data -> train/test split. e.g. 20 means 80% clusters in train, 20% clusters in other
RANDOM_STATE_2 = 6, # random_state_2 = state for splitting other from ^ into val and test
TEST_SIZE_2 = 0.50 # test size for train -> train/val split. e.g. 0.50 means 50% clusters in train, 50% clusters in test
)
# Which models to benchmark
TRAIN = CustomParams(
BENCHMARK_FUSONPLM = True,
FUSONPLM_CKPTS= "FusOn-pLM", # Dictionary: key = run name, values = epochs, or string "FusOn-pLM"
BENCHMARK_ESM = True,
# GPU configs
CUDA_VISIBLE_DEVICES="0",
# Overwriting configs
PERMISSION_TO_OVERWRITE_EMBEDDINGS = False, # if False, script will halt if it believes these embeddings have already been made.
PERMISSION_TO_OVERWRITE_MODELS = False # if False, script will halt if it believes these embeddings have already been made.
)
Clustering
Clustering of all sequences in all_albatross_seqs_and_properties.csv is performed by cluster.py.
The clustering command entered by the script is:
mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.5 --cov-mode 1 --cluster-mode 2 --dbtype 1
The script will generate the following files:
benchmarking/
βββ idr_prediction/
βββ clustering/
βββ input.fasta
βββ mmseqs_full_results.csv
clustering/input.fasta: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in theIDscolumn ofprocessed_data/all_albatross_seqs_and_properties.csv.)clustering/mmseqs_full_results.csv: clustering results. Columns:representative seq_id: the seq_id of the sequence representing this clustermember seq_id: the seq_id of a member of the clusterrepresentative seq: the amino acid sequence of the cluster representative (representative seq_id)member seq: the amino acid sequence of the cluster member
Splitting
Cluster-based splitting is performed by split.py. Results are formatted as follows:
benchmarking/
βββ idr_prediction/
βββ splits/
βββ asph/
βββ test_df.csv
βββ val_df.csv
βββ train_df.csv
βββ scaled_re/... # same format as splits/asph
βββ scaled_rg/... # same format as splits/asph
βββ scaling_exp/... # same format as splits/asph
βββ test_cluster_split.csv
βββ train_cluster_split.csv
βββ val_cluster_split.csv
<split>_cluster_split.csv: cluster information for the clusters in each split (train, val, test). Columns = "representative seq_id", "member seq_id", "representative seq", "member seq", "member length"- π
asph/,scaled_re/,scaled_rg/, andscaling_exp/contain the train, val, and test sets for each property (train_df.csv,val_df.csv, andtest_df.csv). The splits follow<split>_cluster_split.csv, but not every property has a measurement for each of these sequences. The train-val-test ratio still remains 80-10-10 for each property, despite the sequence losses.
Training
The model is defined in model.py and utils.py. The train.py script trains FusOn-pLM-IDR and ESM-2-650M-IDR models separately for each property (asphericity, Re, Rg, scaling exponent) with a hyperparameter screen, saves all results separated by property, and makes plots. plot.py can be used to regenerate the R2 plots.
- All results are stored in
idr_prediction/results/<timestamp>, wheretimestampis a unique string encoding the date and time when you started training. - All raw outputs from models are stored in
idr_prediction/trained_models/<embedding_path>, whereembedding_pathrepresents the embeddings used to build the disorder predictor. - All embeddings made for training will be stored in a new folder called
idr_prediction/embeddings/with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
Below is the FusOn-pLM-IDR raw outputs folder, trained_models/fuson_plm/best/, and the results from the paper, results/final/...
The outputs are structured as follows:
benchmarking/
βββ idr_prediction/
βββ results/final/
βββ r2_plots
βββ asph/
βββ esm2_t33_650M_UR50D_asph_R2.png
βββ esm2_t33_650M_UR50D_asph_R2_source_data.csv
βββ fuson_plm_asph_R2.png
βββ fuson_plm_asph_R2_source_data.csv
βββ scaled_re/ # same format as r2_plots/asph/...
βββ scaled_rg/ # same format as r2_plots/asph/...
βββ scaling_exp/ # same format as r2_plots/asph/...
βββ asph_best_test_r2.csv
βββ asph_hyperparam_screen_test_r2.csv
βββ scaled_re_best_test_r2.csv
βββ scaled_re_hyperparam_screen_test_r2.csv
βββ scaled_rg_best_test_r2.csv
βββ scaled_rg_hyperparam_screen_test_r2.csv
βββ scaling_exp_best_test_r2.csv
βββ scaling_exp_hyperparam_screen_test_r2.csv
βββ trained_models/
βββ asph/
βββ fuson_plm/best/
βββ lr0.0001_bs32/
βββ asph_r2.csv
βββ train_val_losses.csv
βββ test_loss.csv
βββ asph_test_predictions.csv
βββ ... other hyperparameter folders with same format as lr0.001_bs32/
βββ esm2_t33_650M_UR50D # same format as asph/fuson_plm/best/
βββ scaled_re/ # same format as asph/
βββ scaled_rg/ # same format as asph/
βββ scaling_exp/ # same format as asph/
In both directories, results are organized by IDR property and by the type of embedding used to train FusOn-pLM-IDR.
In the π results/final directory.
- π
r2_plots/<property>/: holds all R2 plots and source data (the formatted data used to make the R2 plots) for these properties. <property>_best_test_r2.csv: holds the R2 values for the top-performing models of each embedding type (e.g. ESM-2-650M and a specific checkpoint of FusOn-pLM)<property>_hyperparam_screen_test_r2.csv: holds the R2 values for all embedding types, for all screened hyperparaemters
In the π trained_models directory:
- π
<property>/: holds all results for all trained models predicting this property - π
asph/fuson_plm/best/: holds all FusOn-pLM-IDR results on asphericity prediction for each set of hyperparameters screened when embeddings are made from "fuson_plm/best" (FusOn-pLM model). For example, πlr0.0001_bs32/holds results for learning rate of 0.001, batch size 32. If you were to retrain your own checkpoint of fuson_plm and run the IDR prediction benchmark, its results would be stored in a new subfolder oftrained_models/fuson_plm. asph/fuson_plm/best/lr0.0001_bs32/asph_r2.csv: R2 value for this set of hyperparameters with "fuson_plm/best" embeddingsasph/fuson_plm/best/lr0.0001_bs32/asph_test_predictions.csv: true asphericity values of the test set proteins, alongside FusOn-pLM-IDR's predictions of them.asph/fuson_plm/best/lr0.0001_bs32/test_loss.csv: FusOn-pLM-IDR's asphericity test loss valueasph/fuson_plm/best/lr0.0001_bs32/train_val_losses.csv: FusOn-pLM-IDR's tarining and validation loss over each epoch while training on asphericity data
To run the training script, enter:
nohup python train.py > train.out 2> train.err &
To run the plotting script, enter:
python plot.py