| # Embedding exploration | |
| This folder contains all the data and code needed to run embedding exploration (Fig. S3). | |
| ### Data download | |
| To help select TF (transcription factor) and Kinase-containing fusions for investigation (Fig. S3a), Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8) was downloaded as a reference of transcription factors and kinases. | |
| ``` | |
| benchmarking/ | |
| βββ embedding_exploration/ | |
| βββ data/ | |
| βββ salokas_2020_tableS3.csv | |
| βββ tf_and_kinase_fusions.csv | |
| βββ top_genes.csv | |
| ``` | |
| - **`data/salokas_2020_tableS3.csv`**: Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8) | |
| - **`data/tf_and_kinase_fusions.csv`**: set of TF::TF and Kinase::Kinase fusion oncoproteins from FusOn-DB database. Curated in `plot.py` | |
| - **`data/top_genes.csv`**: fusion oncoproteins (and their head and tail components) visualized in Fig. S3b. Sequences for head and tail components were pulled from the best-aligned sequences in `fuson_plm/data/blast/blast_outputs/best_htg_alignments_swissprot_seqs.pkl` | |
| ### Plotting | |
| Run `plot.py` to regenerate plots in Figure S3: | |
| ``` | |
| # Dictionary: key = run name, values = epochs. (use this option if you've trained your own model) | |
| # # Or "FusOn-pLM" to use official model | |
| FUSON_PLM_CKPT= "FusOn-pLM" | |
| # Type of dim reduction | |
| PLOT_UMAP = True | |
| PLOT_TSNE = False | |
| # Overwriting configs | |
| PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made. | |
| ``` | |
| To run, use: | |
| ``` | |
| nohup python plot.py > plot.out 2> plot.err & | |
| ``` | |
| - All **results** are stored in `embedding_exploration/results/<timestamp>`, where `timestamp` is a unique string encoding the date and time when you started training. | |
| Below are the FusOn-pLM paper results in `results/final/umap_plots/fuson_plm/best/`: | |
| ``` | |
| benchmarking/ | |
| βββ embedding_exploration/ | |
| βββ results/final/umap_plots/fuson_plm/best/ | |
| βββ favorites/ | |
| βββ umap_favorites_source_data.csv | |
| βββ umap_favorites_visualization.png | |
| βββ tf_and_kinase/ | |
| βββ umap_tf_and_kinase_fusions_source_data.csv βββ umap_tf_and_kinase_fusions_visualization.png | |
| ``` | |
| - **`favorites/umap_favorites_visualization.png`**: Fig. S3b, with the data directly plotted stored in `favorites/umap_favorites_source_data.csv` | |
| - **`tf_and_kinase/umap_tf_and_kinase_fusions_visualization.png`**: Fig. S3a, with the data directly plotted stored in `tf_and_kinase/umap_tf_and_kinase_fusions_source_data.csv`. |