File size: 20,974 Bytes
bae913a e048d40 bae913a e048d40 3efa812 bae913a e048d40 bae913a e048d40 bae913a e048d40 bae913a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | ## CAID Benchmark
This folder contains all the data and code needed to perform the **CAID benchmark**, where FusOn-pLM-Diso (a classifier built on FusOn-pLM embeddings) is used to predict per-residue disorder propensities (Figure 4C-F) and plot disorder properties (Figure 1C-1D, S1)
### TL;DR
The order in which to run the scripts:
```
python scrape_fusionpdb.py # pull FusionPDB structures
python process_fusion_structures.py # process FusionPDB structures, and head/tail protein structures
python clean.py # clean disorder data and structure data. Assemble train/test/benchmark splits
python train.py # train models
python analyze_fusion_preds.py # make box chart and line plot of model performance on fusion proteins
python plot.py # plot AUROC of model performance, and additional figures based on disorder data
```
Additional notes:
* `color_disorder_residues.ipynb` is used to plot fusion structures with pLDDT or disorder prediction color overlays.
* We recommend using `nohup` to run longer scripts like `scrape_fusionpdb.py`, `process_fusion_structures.py`, `clean.py`, and `train.py`
### Downloading raw disorder data
Per-residue disorder predictions were used to train and test FusOn-pLM-Diso.
1. **flDPnn** ([Hu et al. 2021](https://doi.org/10.1038/s41467-021-24773-7))
1. At this [link](http://biomine.cs.vcu.edu/servers/flDPnn/?fbclid=IwZXh0bgNhZW0CMTEAAR0KO5CkNdkGC9e5O32S0QoG3BWOw6_egbnioXQNBSv3UC-m_b_dxh70Nnk_aem_z285WFCHdBLw3vOj7LL37A), scroll down to the bottom to find links to the [training](http://biomine.cs.vcu.edu/servers/flDPnn/data/flDPnn_Training_Annotation.txt) and [validation](http://biomine.cs.vcu.edu/servers/flDPnn/data/flDPnn_Validation_Annotation.txt) sets.
2. **IDP-CRF** ([Liu et al. 2018](https://doi.org/10.3390/ijms19092483))
1. Download zipped data from [this link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6164615/bin/ijms-19-02483-s001.zip), remove header and footer, and save as a FASTA file
3. **CAID2-Disorder-NOX** ([Del Conte et al. 2023](https://doi.org/10.1002/prot.26582))
1. Go to [CAID Round 2 Results](https://caid.idpcentral.org/challenge/results?fbclid=IwZXh0bgNhZW0CMTEAAR12dKaA0KywcT71FnyXIrrNS91pwGREsLiq5c2RmfdYl7L0VdUNG7jYai8_aem_tW6Wm9_11ZuiI_GKzbNZjA). Scroll to "Here you can download the references used in the CAID-2 challenge" and you'll find the following links.
1. [disorder_nox.fasta](https://caid.idpcentral.org/assets/sections/challenge/static/references/2/disorder_nox.fasta)
2. [predictions](https://caid.idpcentral.org/assets/sections/challenge/static/predictions/2/predictions.zip) made by all CAID2 participants; AUROC curves can be reconstructed from these
Raw disorder data are stored in `caid/raw_data`
```
benchmarking/
βββ caid/
βββ raw_data/
βββ caid2_competition_results/...
βββ caid2_train_and_test_data/
βββ CAID-2_Disorder_NOX_Testing_Sequences.fasta
βββ flDPnn_Training_Dataset.txt
βββ flDPnn_Validation_Annotation.txt
βββ IDP-CRF_Training_Dataset.txt
```
- π **`raw_data/caid2_competition_results/`**: folder containing raw predictions from CAID2 competitors, downloaded directly from the CAID2 website. Models: AlphaFold-disorder, AlphaFOld-rsa, DeepIDP-2L, disomine, DisoPred, DISOPRED3-diso, Dispredict3, ESpritz-D, flDPlr2, flDPnn, flDPnn2, flDPtr, IDP-Fusion, IUPred3.
- **`raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta`**: Disorder-NOX dataset (used as the test set in this benchmark)
- **`raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt`**: training set for flDPnn
- **`raw_data/caid2_train_and_test_dataflDPnn_Validation_Dataset.txt`**: validation set for flDPnn
- **`raw_data/IDP-CRF_Training_Dataset.txt`**: training set for IDP-CRF
### Processing disorder data
```
benchmarking/
βββ caid/
βββ processed_data/
βββ caid2_competition_results/...
βββ CAID-2_Disorder_NOX_Processed.csv
βββ flDPnn_Training_Dataset.csv
βββ flDPnn_Validation_Dataset.csv
βββ IDP-CRF_Training_Dataset.csv
βββ splits/
βββ splits.csv
βββ train_df.csv
βββ test_df.csv
βββ fusion_bench_df.csv
```
The **`clean.py`** processes and combines the raw data files, generating the following files in π`processed_data/`:
- π **`caid2_competition_results/`**: a folder with table versions of all the files in π `raw_data/caid2_competition_results/`
- **`CAID-2_Disorder_NOX_Processed.csv`**: a table of test data, made by parsing `raw_data/caid2_train_and_test_data/CAID-2_Disorder_NOX_Testing_Sequences.fasta`
- **`flDPnn_Training_Dataset.csv`**: a table of flDPnn's training data, made by parsing `raw_data/caid2_train_and_test_data/flDPnn_Training_Dataset.txt`
- **`flDPnn_Validation_Dataset.csv`**: a table of flDPnn's validation data, made by parsing `raw_data/caid2_train_and_test_data/flDPnn_Validation_Dataset.txt`
- **`IDP-CRF_Training_Dataset.csv`**: a table of IDP-CRF's training data, made by parsing `raw_data/caid2_train_and_test_data/CRF_Training_Dataset.txt`
`clean.py` also generates **the final train-test splits and fusion oncoprotein benchmarking file used to train and evaluate the disorder predictors.** These are stored in π`splits/`
- **`splits.csv`**: sequences, IDs, split (either "Train", "Test", or "Fusion_Benchmark"), andpper-residue disorder labels based on AlphaFold-pLDDT (1 (disordered) if pLDDT< 68.8, 0 (ordered) if >=68.8)
- **`train_df.csv`**: just the Train set portion of `splits.csv`
- **`test_df.csv`**: just the Test set portion of `splits.csv`
- **`fusion_bench_df.csv`**: just the Fusion_Benchmark portion of `splits.csv`. Includes 524 fusion oncoproteins from the FusOn-pLM test set whose structures were collected from FusionPDB (see "Downloading and Processing FusionPDB data
### Downloading and Processing FusionPDB data
The structures of fusion oncoproteins from the FusionPDB database were used to evaluate FusOn-pLM-Diso's performance on fusion oncoproteins. This data was collected by running `scrape_fusionpdb.py`, followed by `process_fusion_structures.py`. These scripts populated the `raw_data` and `processed_data` files simultaneously.
Listed below are all the relevant files:
```
benchmarking/
βββ caid/
βββ raw_data/
βββ fusionpdb/
βββ structures/... # created by scrape_fusionpdb.py (folder not included in repo)
βββ head_tail_af2db_structures/... # created by process_fusion_structures.py (folder not included in repo)
βββ FusionPDB_level2_curated_09_05_2024.csv
βββ FusionPDB_level2_fusion_structure_links.csv
βββ FusionPDB_level3_curated_09_05_2024.csv
βββ FusionPDB_level3_fusion_structure_links.csv
βββ fusionpdb_structureless_ids.txt
βββ hgene_tgene_uniprot_idmap_07_10_2024.txt
βββ level2_head_tail_info.txt
βββ level3_head_tail_info.txt
βββ not_in_afdb_idmap.txt
βββ processed_data/
βββ fusion_pdb/
βββ intermediates/
βββ giant_level_2-3_fusion_protein_head_tail_info.csv
βββ giant_level2-3_fusion_protein_structure_links.csv
βββ giant_level2-3_fusion_protein_structures_processed.csv
βββ uniprotids_not_in_afdb.txt
βββ unmapped_parts.tt
βββ fusion_heads_and_tails.csv
βββ FusionPDB_level2-3_cleaned_FusionGID_info.csv
βββ FusionPDB_level2-3_cleaned_structure_info.csv
βββ heads_tails_structural_data.csv
```
#### βοΈ Pipeline
Here we describe what each script does and which files each script creates.
1. π **`scrape_fusionpdb.py`**
i. Scrapes metadata for FusionPDB Level 2 and Level 3
a. Pulls the online tables for [Level 2](https://compbio.uth.edu/FusionPDB/gene_search_result_0.cgi?type=chooseLevel&chooseLevel=level2) and [Level 3](https://compbio.uth.edu/FusionPDB/gene_search_result_0.cgi?type=chooseLevel&chooseLevel=level3), saving results to `raw_data/FusionPDB_level2_curated_09_05_2024.csv` and `raw_data/FusionPDB_level3_curated_09_05_2024.csv` respectively.
ii. Retrieves structure links
a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3, and downloads all AlphaFold2 structure links for each FO.
b. Saves results directly to `raw_data/FusionPDB_level2_fusion_structure_links.csv` and `raw_data/FusionPDB_level3_fusion_structure_links.csv`, respectively
iii. Retrieves FO head gene and tail gene info
a. Using the tables collected in step (i), visits the page for each fusion oncoprotein (FO) in FusionPDB Level 2 and 3 to download head/tail info. Collects HGID and TGID (GeneIDs for head and tail) and UniProt accessions for each.
b. Saves results directly to `raw_data/level2_head_tail_info.txt` and `raw_data/level3_head_tail_info.txt`, respectively.
iv. Combines Level 2 and 3 head/tail data
a. Merges `raw_data/level2_head_tail_info.txt` and `raw_data/level3_head_tail_info.txt` into a dataframe.
b. Saves result at `processed_data/fusionpdb/fusion_heads_and_tails.csv` (columns="FusionGID","HGID","TGID","HGUniProtAcc","TGUniProtAcc")
v. Combines Level 2 and 3 structure link data
a. Joins structure link data with metadata for each of levels 2 and 3, then combines the result.
b. Saves result at `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv`
vi. Combines structure link data and metadata (result of step (v)) with head and tail data (result of step (iv)), and resolves any missing head/tail UniProt IDs.
a. Merges the data
b. Checks how many rows have either missing or wrong UniProt accessions for the head or tail gene, and compiles the gene symbols for online quering in the UniProt ID Mapping tool (`processed_data/fusionpdb/intermediates/unmapped_parts.txt`)
c. Reads the UniProt ID Mapping result. Combines this data with FusionPDB-scraped data by matching FusionPDB's HGID (GeneID for head) and TGID (GeneID for tail) with the GeneID returned by UniProt.
d. For any FO where FusionPDB lacked a UniProt ID for the head/tail, this ID is filled in from the UniProt ID Mapping result.
e. Saves result to `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv`. Columns: "FusionGID","FusionGene","Hgene","Tgene","URL","HGID","TGID","HGUniProtAcc","TGUniProtAcc","HGUniProtAcc_Source","TGUniProtAcc_Source", where the "_Source" columns indicate whether the UniProt ID came from FusionPDB, or from the ID Map.
vii. Downloads AlphaFold2 structures of FOs from FusionPDB.
a. Using structure links from `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structure_links.csv` (step (v)), directly downloads `.pdb` and `.cif` files.
b. Saves results in π`raw_data/fusionpdb/structures`
2. π **`process_fusion_structures.py`**
i. Determines pLDDT(s) for each FO structure.
a. For each structure in π`raw_data/fusionpdb_structures/`, determines amino acid sequence, per-residue pLDDT, and average pLDDT from the AlphaFold2 structure.
b. Saves results in `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_structures_processed.csv`.
ii. Downloads AlphaFold2 structures for all head and tail proteins
a. Reads `processed_data/fusionpdb/intermediates/giant_level2-3_fusion_protein_head_tail_info.csv` and collects all unique UniProt IDs for all head/tail proteins.
b. For each UniProt ID, queries the AlphaFoldDB, downloads the AlphaFold2 structure (if available), and saves it to π`raw_data/fusionpdb/head_tail_af2db_structures/`. Saves files converted from PDB to CIF format in `mmcif_converted_files`. Then, extracts the sequence, per-residue pLDDT, and average pLDDT from the file.
c. Saves any UniProt IDs that did not have structures in the AlphaFoldDB to: `processed_data/fusionpdb/intermediates/uniprotids_not_in_afdb.txt`. Most of these were very long, but the shorter ones were folded and their average pLDDTs were manually inputted. These were put back into the AlphaFold ID map to look for alternative UniProt IDs, and their results are in `not_in_afdb_idmap.txt`.
d. Saves results to `processed_data/fusionpdb/heads_tails_structural_data.csv`
iii. Cleans the dataase of level 2&3 structural info
a. Drops rows where no structure was successfully downloaded
b. Drops rows where the FO sequence from FusionPDB does not match the FO sequence from its own AlphaFold2 structure file
c. βοΈSaves **two final, cleaned databases**βοΈ:
a. βοΈ **`FusionPDB_level2-3_cleaned_FusionGID_info.csv`**: includes ful IDs and structural information for the Hgene and Tgene of each FO. Columns = "FusionGID", "FusionGene", "Hgene", "Tgene", "URL", "HGID", "TGID", "HGUniProtAcc", "TGUniProtAcc", "HGUniProtAcc_Source", "TGUniProtAcc_Source", "HG_pLDDT", "HG_AA_pLDDTs", "HG_Seq", "TG_pLDDT", "TG_AA_pLDDTs", "TG_Seq".
b. βοΈ **`FusionPDB_level2-3_cleaned_structure_info.csv`**: includes full structural information for each FO. Columns = "FusionGID", "FusionGene", "Fusion_Seq", "Fusion_Length", "Hgene", "Hchr", "Hbp", "Hstrand", "Tgene", "Tchr", "Tbp", "Tstrand", "Level", "Fusion_Structure_Link", "Fusion_Structure_Type", "Fusion_pLDDT", "Fusion_AA_pLDDTs", "Fusion_Seq_Source"
### Training
The model is defined in `model.py` and `utils.py`. Training configs can be provided in `config.py`:
```
# Which models to benchmark
BENCHMARK_FUSONPLM = True
# FUSONPLM_CKPTS. If you've traiend your own model, this is a dictionary: key = run name, values = epochs
# If you want to use the trained FusOn-pLM, instead FUSONPLM_CKPTS="FusOn-pLM"
FUSONPLM_CKPTS= "FusOn-pLM"
BENCHMARK_ESM = True
# GPU configs
CUDA_VISIBLE_DEVICES="0"
# Overwriting configs
PERMISSION_TO_OVERWRITE_EMBEDDINGS = False # if False, script will halt if it believes these embeddings have already been made.
PERMISSION_TO_OVERWRITE_MODELS = False # if False, script will halt if it believes these embeddings have already been made.
```
`train.py` trains the models using embeddings indicated in `config.py`. It also performs a hyperparameter screen.
- All **results** are stored in `caid/results/<timestamp>`, where `timestamp` is a unique string encoding the date and time when you started training.
- All **raw outputs from models** are stored in `caid/trained_models/<embedding_path>`, where `embedding_path` represents the embeddings used to build the disorder predictor.
- All **embeddings** made for training will be stored in a new folder called `caid/embeddings/` with subfolders for each model. This allows you to use the same model multiple times without regenerating embeddings.
Below is the FusOn-pLM-Diso raw outputs folder, `trained_models/fuson_plm/best/'. (ESM-2-650M-Diso has a folder in the same format, and future trained models will as well):
```
benchmarking/
βββ caid/
βββ trained_models/
βββ esm2_t33_650M_UR50D/best/
βββ fuson_plm/best/
βββ caid_hyperparam_screen_fusion_benchmark_metrics.csv
βββ caid_hyperparam_screen_fusion_benchmark_probs.csv
βββ caid_hyperparam_screen_test_metrics.csv
βββ caid_hyperparam_screen_test_probs.csv
βββ caid_train_losses.csv
βββ params.txt
```
- **`caid_hyperparam_screen_fusion_benchmark_metrics.csv`**: performance metrics (Accuracy, Precision, Recall, F1 Score, AUROC) for the top model on the fusion benchmark set (`splits/fusion_bench_df.csv`)
- **`caid_hyperparam_screen_fusion_benchmark_probs.csv`**: for the fusion benchmark, raw probabilities of class 1 (disorder), threshold used to assign 0/1 based on maximized F1 score, prediction labels based on probabilities and threshold
- **`caid_hyperparam_screen_test_metrics.csv`**: same as `caid_hyperparam_screen_fusion_benchmark_metrics.csv`, but for CAID2 Disorder-NOX (`splits/test_df.csv`)
- **`caid_hyperparam_screen_test_probs.csv`**: same as `caid_hyperparam_screen_fusion_benchmark_probs`, but for CAID2 Disorder-NOX
- **`caid_train_losses.csv`**: train losses over the 2 training epochs for top-performing model
- **`params.txt`**: hyperparameters of top performing model
Results from the FusOn-pLM manuscript are found in `results/final`. A few extra data files and plots are added by `analyze_fusion_preds.py`
```
benchmarking/
βββ caid/
βββ results/final
βββ best_caid_model_results.csv
βββ caid_hyperparam_screen_test_metrics.csv
βββ caid_hyperparam_screen_fusion_benchmark_metrics.csv
βββ caid_hyperparam_screen_train_losses.csv
βββ fusion_disorder_boxplots.png
βββ fusion_pred_disorder_r2.png
βββ fusion_disorder_boxplots_source_data.csv
βββ fusion_pred_disorder_r2_source_data.csv
βββ CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
βββ CAID_fpr_tpr_source_data.csv
βββ CAID_prediction_source_data.csv
```
- **`best_caid_model_results.csv`**: Summary file of hyperparameters, test set statistics, and fusion benchmark statistics for the best model of each type screened (ESM-2-650M, FusOn-pLM)
- **`caid_hyperparam_screen_fusion_benchmark_metrics.csv`**: Fusion benchmark set statistics for full hyperparameter screen
- **`caid_hyperparam_screen_fusion_benchmark_metrics.csv`**: Test set statistics for full hyperparameter screen
- **`caid_hyperparam_screen_train_losses.csv`**: Train losses for full hyperparameter screen
- π **`fusion_disorder_boxplots.png`**: Fig. 4E, left (data directly used to produce the plot at `fusion_disorder_boxplots_source_data.csv`)
- π **`fusion_pred_disorder_r2_source_data.csv`**: Fig. 4E, right (data directly used to produce the plot at `fusion_pred_disorder_r2_source_data.csv`)
- π **`CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png`**: Fig. 4D (probabilities used at `CAID_prediction_source_data.csv`, FPR/TPR relationships directly used to make the plot at `CAID_fpr_tpr_source_data.csv`)
To run the training script, use
```
nohup python train.py > train.out 2> train.err &
```
### Plotting
The `plot.py` script generates many figures from the paper, alongside the formatted data directly used for plotting.
```
benchmarking/
βββ caid/
βββ results/final/
βββ CAID2_FusOn-pLM-Diso_with_ESM_AUROC_curve.png
βββ processed_data/
βββ figures/
βββ fusion_disorder/
βββ plddt_sequence_EML4-ALK.png
βββ plddt_sequence_EML4::ALK_source_data.csv
βββ plddt_sequence_EWSR1-FLI1.png
βββ plddt_sequence_EWSR1::FLI1_source_data.csv
βββ plddt_sequence_PAX3-FOXO1.png
βββ plddt_sequence_PAX3::FOXO1_source_data.csv
βββ plddt_sequence_SS18-SSX1.png
βββ plddt_sequence_SS18::SSX1_source_data.csv
βββ histograms/
βββ disorder_nox_histogram.png
βββ disorder_nox_histogram_source_data.csv
βββ fusions_histogram.png
βββ fusions_histogram_source_data.csv
βββ heads_histogram.png
βββ heads_histogram_source_data.csv
βββ tails_histogram.png
βββ tails_histogram_source_data.csv
```
- Plots in `fusion_disorder` are from Fig. 1C
- Plots in `hisograms` are from Fig. 1D and Fig. S1
To regenerate these plots and source data, run:
```
python plot.py
```
### Colored structure images
`color_disorder_residues.ipynb` is used to plot fusion structures with pLDDT or disorder prediction color overlays. By running certain (or all) of its cells, you will recreate images from Fig. 1C and 4F, as well as the following file:
```
benchmarking/
βββ caid/
βββ disorder_coloring_data
βββ normalized_disorder_propensities_source_data.csv
```
- **`normalized_disorder_propensities_source_data.csv`**: the normalized disorder propensities that were visualized on fusion structures in Fig. 4F |