Training Data Curation and Processing
The data folder and its subfolders hold all raw data and processed data used to assemble FusOn-DB, as well as all processing scripts. Additional benchmarking datasets can be found in the benchmarking folder.
From raw data to train/val/test splits and head/tail data
This section will outline the pipeline for converting the raw FusionPDB and FOdb datasets into the train/val/test splits used in FusOn-pLM. This process included data cleaning, clustering, and splitting. During the cleaning process, we also extracted data about the heads and tails of each fusion oncoprpotein.
data/
βββ clustering/
βββ input.fasta
βββ mmseqs_full_results.csv
βββ head_tail_data/
βββ uniprot_idmap_inputs/
βββ raw_data/
βββ FOdb_all.csv
βββ FOdb_puncta.csv
βββ FOdb_SD5.csv
βββ FusionPDB_cleaned.csv
βββ FusionPDB.txt
βββ gene_to_ensembl_dict.pkl
βββ splits/
βββ combined_plot.png
βββ train_df.csv
βββ train_cluster_split.csv
βββ val_df.csv
βββ val_cluster_split.csv
βββ test_df.csv
βββ test_cluster_split.csv
βββ clean.py
βββ cluster.py
βββ config.py
βββ split.py
βββ split_vis.py
βββ data_cleaning_log.txt
βββ clustering_log.txt
βββ splitting_log.txt
βββ fuson_db.csv
clean.py: script for cleaning the datasets inraw_data. Print statements in this code producedata_cleaning_log.txt.cluster.py: script for clustering the processed data in fuson_db.csv. Print statements in this code produceclustering_log.txt.config.py: configs for the cleaning, clustering, and splitting scripts.split.py: script for splitting the data, post-clusteirng. Print statements in this code producesplitting_log.txt.split_vis.pyscript with code for the plots insplits/combined_plot.png, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes). Note that many of the methods are defined infuson_plm/utils/visualizing.py.
Usage
To repeat our cleaning, clustering, and splitting process, proceed as follows.
- Install MMSeqs2 at
/*/FusOn-pLM/fuson_plm/mmseqs2according to these instructions: https://github.com/soedinglab/MMseqs2. Make sure that inconfig.py, CLUSTER.PATH_TO_MMSEQS points to your mmseqs installation. - Run the cleaning script:
python clean.py
This script will create the following files:
fuson_db.csv: FusOn-DB. Our full database of 44,414 fusion oncoproteins.raw_data/FusionPDB_cleaned.csv: a processed version of the FusionPDB database with the following columns:aa_seq,n_fusiongenes,fusiongenes,cancers,primary_sources,secondary_source.head_tail_data/uniprot_idmap_inputs/head_tail_ens.txtandhead_tail_data/uniprot_idmap_inputs/head_tail_genes.txt: all unique Ensembl IDs and gene symbols for all unique head/tail proteins corresponding to any fusion oncoproteins in FusOn-DB. These were submitted to the UniProt ID-mapping tool to createhead_tail_data/ensembl_ht_idmap.txtand **head_tail_data/genename_ht_idmap.txt, respectively.head_tail_data/uniprot_idmap_inputs/gene_to_ensembl_dict.pkl: a dictionary mapping each unique gene symbol to a comma-separated list of its associated Ensembl IDs, according to FusionPDB.head_tail_data/uniprot_idmap_inputs/htgenes_uniprotids.csva file with each unique gene symbol (Gene), a comma-separated list of all associated UniProt IDs (UniProtID), and a concatenated list of 1s and 0s representing whether each ID in theUniProtIDcolumn is reviewed or not (Reviewed).- For example, a
Reviewedvalue of "100" means the first ID in theUniProtIDcolumn of the same row is reviewed (1) and the second and third are not (0)
- For example, a
- Run the clustering script:
python cluster.py
The command entered by this script to the clustering software is:
mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.8 --cov-mode 0
This script will cluster all sequences length 2000 or shorter (see config.py) and create the following files:
clustering/input.fasta: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in theseq_idcolumn offuson_db.csv.)clustering/mmseqs_full_results.csv: clustering results. Columns:representative seq_id: the seq_id of the sequence representing this clustermember seq_id: the seq_id of a member of the clusterrepresentative seq: the amino acid sequence of the cluster representative (representative seq_id)member seq: the amino acid sequence of the cluster member
- Run the splitting script:
python split.py
This script will create the following files:
splits/train_cluster_split.csv,splits/val_cluster_split.csv,splits/test_cluster_split.csv: The subsets ofclustering/mmseqs_full_results.csvthat have been partitioned into the train, validation, and test sets respectively.splits/train_df.csv,splits/val_df.csv,splits/test_df.csv: The train, validation, and testing splits used to train FusOn-pLM. Columns:sequence,member length- the
split_visfolder, which contains all visualizations in Fig. S4 and the data that was directly plotted in these visualizations (*_source_data.csvfiles). Note that the individual subplots have slightly different dimensions than they do in the combined Fig. S4splits/split_vis/combined_plot.png: plot displaying the composition of the train, validation, and test splits (Fig. S4).splits/split_vis/length_distributions.png: plot displaying the length distributions of the train, validation, and test splits (Fig. 4A)splits/split_vis/shannon_entropy_plot.png: plot displaying the Shannon entropy distributions of train, validation, and test sets (Fig. 4B)splits/split_vis/scatterplot.png: plot displaying the cluster size distributions of the train, validation, and test sets (Fig. 4C)splits/split_vis/aa_comp.png: plot displaying the amino acid composition of the train, validation, and test splits (Fig. S4D).
BLAST
We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the blast folder for more details.