Fill-Mask
Transformers
Safetensors
esm
root
dependencies and embedding_exploration benchmark
c43fbc6
# Embedding exploration
This folder contains all the data and code needed to run embedding exploration (Fig. S3).
### Data download
To help select TF (transcription factor) and Kinase-containing fusions for investigation (Fig. S3a), Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8) was downloaded as a reference of transcription factors and kinases.
```
benchmarking/
└── embedding_exploration/
└── data/
β”œβ”€β”€ salokas_2020_tableS3.csv
β”œβ”€β”€ tf_and_kinase_fusions.csv
β”œβ”€β”€ top_genes.csv
```
- **`data/salokas_2020_tableS3.csv`**: Supplementary Table 3 from [Salokas et al. 2020](https://doi.org/10.1038/s41598-020-71040-8)
- **`data/tf_and_kinase_fusions.csv`**: set of TF::TF and Kinase::Kinase fusion oncoproteins from FusOn-DB database. Curated in `plot.py`
- **`data/top_genes.csv`**: fusion oncoproteins (and their head and tail components) visualized in Fig. S3b. Sequences for head and tail components were pulled from the best-aligned sequences in `fuson_plm/data/blast/blast_outputs/best_htg_alignments_swissprot_seqs.pkl`
### Plotting
Run `plot.py` to regenerate plots in Figure S3:
```
# Dictionary: key = run name, values = epochs. (use this option if you've trained your own model)
# # Or "FusOn-pLM" to use official model
FUSON_PLM_CKPT= "FusOn-pLM"
# Type of dim reduction
PLOT_UMAP = True
PLOT_TSNE = False
# Overwriting configs
PERMISSION_TO_OVERWRITE = False # if False, script will halt if it believes these embeddings have already been made.
```
To run, use:
```
nohup python plot.py > plot.out 2> plot.err &
```
- All **results** are stored in `embedding_exploration/results/<timestamp>`, where `timestamp` is a unique string encoding the date and time when you started training.
Below are the FusOn-pLM paper results in `results/final/umap_plots/fuson_plm/best/`:
```
benchmarking/
└── embedding_exploration/
└── results/final/umap_plots/fuson_plm/best/
└── favorites/
β”œβ”€β”€ umap_favorites_source_data.csv
β”œβ”€β”€ umap_favorites_visualization.png
└── tf_and_kinase/
β”œβ”€β”€ umap_tf_and_kinase_fusions_source_data.csv β”œβ”€β”€ umap_tf_and_kinase_fusions_visualization.png
```
- **`favorites/umap_favorites_visualization.png`**: Fig. S3b, with the data directly plotted stored in `favorites/umap_favorites_source_data.csv`
- **`tf_and_kinase/umap_tf_and_kinase_fusions_visualization.png`**: Fig. S3a, with the data directly plotted stored in `tf_and_kinase/umap_tf_and_kinase_fusions_source_data.csv`.