Fill-Mask
Transformers
Safetensors
esm
Sophia Vincoff
mutation prediction discovery and recovery
3efa812

IDR Property Prediction Benchmark

This folder contains all the data and code needed to perform the IDR property prediction benchmark, where FusOn-pLM-IDR (a regressor built on FusOn-pLM embeddings) is used to predict aggregate properties of intrinsically disordered regions (IDRs), specifically asphericity, end-to-end radius (Re), radius of gyration (Rg), and polymer scaling exponent (Figure 4A-B).

TL;DR

The order in which to run the scripts, after downloading data:

python clean.py             # clean the data
python cluster.py           # MMSeqs2 clustering
python split.py             # make cluster-based train/val/test splits
python train.py             # train the model
python plot.py              # if you want to remake r2 plots

Downloading raw IDR data

IDR properties from Lotthammer et al. 2024 (ALBATROSS model) were used to train FusOn-pLM-IDR. Sequences were downloaded from this link and deposited in raw_data. All files in raw_data are from this direct download.

benchmarking/
└── idr_prediction/ 
    └── raw_data/
        β”œβ”€β”€ asph_bio_synth_training_data_cleaned_05_09_2023.tsv
        β”œβ”€β”€ asph_nat_meth_test.tsv
        β”œβ”€β”€ scaled_re_bio_synth_training_data_cleaned_05_09_2023.tsv
        β”œβ”€β”€ scaled_re_nat_meth_test.tsv
        β”œβ”€β”€ scaled_rg_bio_synth_training_data_cleaned_05_09_2023.tsv
        β”œβ”€β”€ scaled_rg_nat_meth_test.tsv
        β”œβ”€β”€ scaling_exp_bio_synth_training_data_cleaned_05_09_2023.tsv
        β”œβ”€β”€ scaling_exp_nat_meth_test.tsv
  • asph=asphericity, scaled_re=scaled Re, scaled_rg=scaled Rg, scaling_exp=polymer scaling exponent
  • <property>_bio_synth_training_data_cleaned_05_09_2023.tsv are ALBATROSS training data for the four properties, downloaded directly their GitHub
  • <property>_nat_meth_test.tsv are ALBATROSS testing data for the four proeprties, downloaded directly from their GitHub

Cleaning raw IDR data

clean.py cleans the raw training and testing data separately for each property. Any duplicates (in both train and test) are removed from train and kept in test. Finally, the four are combined into one file:

benchmarking/
└── idr_prediction/ 
    └── processed_data/
        β”œβ”€β”€ all_albatross_seqs_and_properties.csv
  • all_albatross_seqs_and_properties.csv: Columns = "Sequence","IDs","UniProt_IDs","UniProt_Names","Split","asph","scaled_re","scaled_rg","scaling_exp". All splits are either "Train" or "Test", indicating ALBATROSS model's usage of them

To perform cleaning, run

python clean.py

Using config.py for clustering, splitting, training

This file has configurations for clustering, splitting, training.

# Clustering Parameters
CLUSTER = CustomParams(
    # MMSeqs2 parameters: see GitHub or MMSeqs2 Wiki for guidance
    MIN_SEQ_ID = 0.3,                                                   # % identity
    C = 0.5,                                                            # % sequence length overlap
    COV_MODE = 1,                                                       # cov-mode: 0 = bidirectional, 1 = target coverage, 2 = query coverage, 3 = target-in-query length coverage.
    CLUSTER_MODE = 2,
    # File paths
    INPUT_PATH = 'processed_data/all_albatross_seqs_and_properties.csv',
    PATH_TO_MMSEQS = '../../mmseqs'                                     # path to where you installed MMSeqs2   
)

# Split config
SPLIT = CustomParams(
    IDR_DB_PATH = 'processed_data/all_albatross_seqs_and_properties.csv',
    CLUSTER_OUTPUT_PATH = 'clustering/mmseqs_full_results.csv',    
    RANDOM_STATE_1 = 2,                                     # random_state_1 = state for splitting all data into train & other
    TEST_SIZE_1 = 0.21,                                     # test size for data -> train/test split. e.g. 20 means 80% clusters in train, 20% clusters in other
    RANDOM_STATE_2 = 6,                                     # random_state_2 = state for splitting other from ^ into val and test
    TEST_SIZE_2 = 0.50                                      # test size for train -> train/val split. e.g. 0.50 means 50% clusters in train, 50% clusters in test

)

# Which models to benchmark
TRAIN = CustomParams(
    BENCHMARK_FUSONPLM = True,
    FUSONPLM_CKPTS= "FusOn-pLM",                            # Dictionary: key = run name, values = epochs, or string "FusOn-pLM"
    BENCHMARK_ESM = True,

    # GPU configs
    CUDA_VISIBLE_DEVICES="0",

    # Overwriting configs
    PERMISSION_TO_OVERWRITE_EMBEDDINGS = False,             # if False, script will halt if it believes these embeddings have already been made. 
    PERMISSION_TO_OVERWRITE_MODELS = False                  # if False, script will halt if it believes these embeddings have already been made.
)

Clustering

Clustering of all sequences in all_albatross_seqs_and_properties.csv is performed by cluster.py.

The clustering command entered by the script is:

mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.5 --cov-mode 1 --cluster-mode 2 --dbtype 1

The script will generate the following files:

benchmarking/
└── idr_prediction/ 
    └── clustering/
        β”œβ”€β”€ input.fasta
        β”œβ”€β”€ mmseqs_full_results.csv
  • clustering/input.fasta: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in the IDs column of processed_data/all_albatross_seqs_and_properties.csv.)
  • clustering/mmseqs_full_results.csv: clustering results. Columns:
    • representative seq_id: the seq_id of the sequence representing this cluster
    • member seq_id: the seq_id of a member of the cluster
    • representative seq: the amino acid sequence of the cluster representative (representative seq_id)
    • member seq: the amino acid sequence of the cluster member

Splitting

Cluster-based splitting is performed by split.py. Results are formatted as follows:

benchmarking/
└── idr_prediction/ 
    └── splits/
        └── asph/
            β”œβ”€β”€ test_df.csv
            β”œβ”€β”€ val_df.csv
            β”œβ”€β”€ train_df.csv
        └── scaled_re/...   # same format as splits/asph
        └── scaled_rg/...   # same format as splits/asph
        └── scaling_exp/... # same format as splits/asph
        β”œβ”€β”€ test_cluster_split.csv
        β”œβ”€β”€ train_cluster_split.csv
        β”œβ”€β”€ val_cluster_split.csv
  • <split>_cluster_split.csv: cluster information for the clusters in each split (train, val, test). Columns = "representative seq_id", "member seq_id", "representative seq", "member seq", "member length"
  • πŸ“ asph/, scaled_re/, scaled_rg/, and scaling_exp/ contain the train, val, and test sets for each property (train_df.csv, val_df.csv, and test_df.csv). The splits follow <split>_cluster_split.csv, but not every property has a measurement for each of these sequences. The train-val-test ratio still remains 80-10-10 for each property, despite the sequence losses.

Training

The model is defined in model.py and utils.py. The train.py script trains FusOn-pLM-IDR and ESM-2-650M-IDR models separately for each property (asphericity, Re, Rg, scaling exponent) with a hyperparameter screen, saves all results separated by property, and makes plots. plot.py can be used to regenerate the R2 plots.

  • All results are stored in idr_prediction/results/<timestamp>, where timestamp is a unique string encoding the date and time when you started training.
  • All raw outputs from models are stored in idr_prediction/trained_models/<embedding_path>, where embedding_path represents the embeddings used to build the disorder predictor.
  • All embeddings made for training will be stored in a new folder called idr_prediction/embeddings/ with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.

Below is the FusOn-pLM-IDR raw outputs folder, trained_models/fuson_plm/best/, and the results from the paper, results/final/...

The outputs are structured as follows:

benchmarking/
└── idr_prediction/ 
    └── results/final/
        └── r2_plots
            └── asph/
                β”œβ”€β”€ esm2_t33_650M_UR50D_asph_R2.png
                β”œβ”€β”€ esm2_t33_650M_UR50D_asph_R2_source_data.csv
                β”œβ”€β”€ fuson_plm_asph_R2.png
                β”œβ”€β”€ fuson_plm_asph_R2_source_data.csv
            └── scaled_re/          # same format as r2_plots/asph/...
            └── scaled_rg/          # same format as r2_plots/asph/...
            └── scaling_exp/        # same format as r2_plots/asph/...
        β”œβ”€β”€ asph_best_test_r2.csv  
        β”œβ”€β”€ asph_hyperparam_screen_test_r2.csv   
        β”œβ”€β”€ scaled_re_best_test_r2.csv  
        β”œβ”€β”€ scaled_re_hyperparam_screen_test_r2.csv
        β”œβ”€β”€ scaled_rg_best_test_r2.csv
        β”œβ”€β”€ scaled_rg_hyperparam_screen_test_r2.csv
        β”œβ”€β”€ scaling_exp_best_test_r2.csv
        β”œβ”€β”€ scaling_exp_hyperparam_screen_test_r2.csv
    └── trained_models/
        └── asph/
            └── fuson_plm/best/
                └── lr0.0001_bs32/
                    β”œβ”€β”€ asph_r2.csv
                    β”œβ”€β”€ train_val_losses.csv
                    β”œβ”€β”€ test_loss.csv
                    β”œβ”€β”€ asph_test_predictions.csv
                └── ... other hyperparameter folders with same format as lr0.001_bs32/ 
            └── esm2_t33_650M_UR50D         # same format as asph/fuson_plm/best/
        └── scaled_re/              # same format as asph/
        └── scaled_rg/              # same format as asph/
        └── scaling_exp/            # same format as asph/
        

In both directories, results are organized by IDR property and by the type of embedding used to train FusOn-pLM-IDR.

In the πŸ“ results/final directory.

  • πŸ“ r2_plots/<property>/: holds all R2 plots and source data (the formatted data used to make the R2 plots) for these properties.
  • <property>_best_test_r2.csv: holds the R2 values for the top-performing models of each embedding type (e.g. ESM-2-650M and a specific checkpoint of FusOn-pLM)
  • <property>_hyperparam_screen_test_r2.csv: holds the R2 values for all embedding types, for all screened hyperparaemters

In the πŸ“ trained_models directory:

  • πŸ“ <property>/: holds all results for all trained models predicting this property
  • πŸ“ asph/fuson_plm/best/: holds all FusOn-pLM-IDR results on asphericity prediction for each set of hyperparameters screened when embeddings are made from "fuson_plm/best" (FusOn-pLM model). For example, πŸ“ lr0.0001_bs32/ holds results for learning rate of 0.001, batch size 32. If you were to retrain your own checkpoint of fuson_plm and run the IDR prediction benchmark, its results would be stored in a new subfolder of trained_models/fuson_plm.
  • asph/fuson_plm/best/lr0.0001_bs32/asph_r2.csv: R2 value for this set of hyperparameters with "fuson_plm/best" embeddings
  • asph/fuson_plm/best/lr0.0001_bs32/asph_test_predictions.csv: true asphericity values of the test set proteins, alongside FusOn-pLM-IDR's predictions of them.
  • asph/fuson_plm/best/lr0.0001_bs32/test_loss.csv: FusOn-pLM-IDR's asphericity test loss value
  • asph/fuson_plm/best/lr0.0001_bs32/train_val_losses.csv: FusOn-pLM-IDR's tarining and validation loss over each epoch while training on asphericity data

To run the training script, enter:

nohup python train.py > train.out 2> train.err &

To run the plotting script, enter:

python plot.py