Fill-Mask
Transformers
Safetensors
esm
File size: 6,713 Bytes
1e6a1f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6efd653
1e6a1f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6efd653
 
 
 
 
 
1e6a1f0
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
## Training Data Curation and Processing

The `data` folder and its subfolders hold all raw data and processed data used to assemble FusOn-DB, as well as all processing scripts. Additional benchmarking datasets can be found in the `benchmarking` folder.  

### From raw data to train/val/test splits and head/tail data
This section will outline the pipeline for converting the raw FusionPDB and FOdb datasets into the train/val/test splits used in FusOn-pLM. This process included data cleaning, clustering, and splitting. During the cleaning process, we also extracted data about the heads and tails of each fusion oncoprpotein. 

```
data/                                 
└── clustering/ 
    β”œβ”€β”€ input.fasta
    β”œβ”€β”€ mmseqs_full_results.csv
└── head_tail_data/  
    └── uniprot_idmap_inputs/  
└── raw_data/ 
    β”œβ”€β”€ FOdb_all.csv
    β”œβ”€β”€ FOdb_puncta.csv
    β”œβ”€β”€ FOdb_SD5.csv
    β”œβ”€β”€ FusionPDB_cleaned.csv
    β”œβ”€β”€ FusionPDB.txt
    β”œβ”€β”€ gene_to_ensembl_dict.pkl
└── splits/ 
    β”œβ”€β”€ combined_plot.png
    β”œβ”€β”€ train_df.csv
    β”œβ”€β”€ train_cluster_split.csv
    β”œβ”€β”€ val_df.csv
    β”œβ”€β”€ val_cluster_split.csv
    β”œβ”€β”€ test_df.csv
    β”œβ”€β”€ test_cluster_split.csv
β”œβ”€β”€ clean.py
β”œβ”€β”€ cluster.py
β”œβ”€β”€ config.py
β”œβ”€β”€ split.py
β”œβ”€β”€ split_vis.py
β”œβ”€β”€ data_cleaning_log.txt
β”œβ”€β”€ clustering_log.txt
β”œβ”€β”€ splitting_log.txt
β”œβ”€β”€ fuson_db.csv
```
- **`clean.py`**: script for cleaning the datasets in `raw_data`. Print statements in this code produce `data_cleaning_log.txt`.
- **`cluster.py`**: script for clustering the processed data in fuson_db.csv. Print statements in this code produce `clustering_log.txt`.
- **`config.py`**: configs for the cleaning, clustering, and splitting scripts. 
- **`split.py`**: script for splitting the data, post-clusteirng. Print statements in this code produce `splitting_log.txt`.
- **`split_vis.py`** script with code for the plots in `splits/combined_plot.png`, which describe the content of the train, validation, and test splits (length distribution, Shannon Entropy, amino acid frequencies, and cluster sizes). Note that many of the methods are defined in `fuson_plm/utils/visualizing.py`. 

#### Usage
To repeat our cleaning, clustering, and splitting process, proceed as follows.
1. Install MMSeqs2 at `/*/FusOn-pLM/fuson_plm/mmseqs2` according to these instructions: https://github.com/soedinglab/MMseqs2. Make sure that in `config.py`, CLUSTER.PATH_TO_MMSEQS points to your mmseqs installation.
2. Run the cleaning script:
```python
python clean.py
```

This script will create the following files:
- **`fuson_db.csv`**: FusOn-DB. Our full database of 44,414 fusion oncoproteins.
- **`raw_data/FusionPDB_cleaned.csv`**: a processed version of the FusionPDB database with the following columns: `aa_seq`,`n_fusiongenes`,`fusiongenes`,`cancers`,`primary_sources`,`secondary_source`. 
- **`head_tail_data/uniprot_idmap_inputs/head_tail_ens.txt`** and **`head_tail_data/uniprot_idmap_inputs/head_tail_genes.txt`**: all unique Ensembl IDs and gene symbols for all unique head/tail proteins corresponding to any fusion oncoproteins in FusOn-DB. These were submitted to the UniProt ID-mapping tool to create **`head_tail_data/ensembl_ht_idmap.txt`** and **`head_tail_data/genename_ht_idmap.txt`, respectively. 
- **`head_tail_data/uniprot_idmap_inputs/gene_to_ensembl_dict.pkl`**: a dictionary mapping each unique gene symbol to a comma-separated list of its associated Ensembl IDs, according to FusionPDB.
- **`head_tail_data/uniprot_idmap_inputs/htgenes_uniprotids.csv`** a file with each unique gene symbol (`Gene`), a comma-separated list of all associated UniProt IDs (`UniProtID`), and a concatenated list of 1s and 0s representing whether each ID in the `UniProtID` column is reviewed or not (`Reviewed`). 
    - For example, a `Reviewed` value of "100" means the first ID in the `UniProtID` column of the same row is reviewed (1) and the second and third are not (0)

3. Run the clustering script:
```python
python cluster.py
```

The command entered by this script to the clustering software is:
```bash
mmseqs easy-cluster clustering/input.fasta clustering/raw_output/mmseqs clustering/raw_output --min-seq-id 0.3 -c 0.8 --cov-mode 0
```

This script will cluster all sequences length 2000 or shorter (see `config.py`) and create the following files:
- **`clustering/input.fasta`**: the input file used by MMSeqs2 to cluster the fusion oncoprotein sequences. Headers are our assigned sequence IDs (can be found in the `seq_id` column of `fuson_db.csv`.)
- **`clustering/mmseqs_full_results.csv`**: clustering results. Columns:
    - `representative seq_id`: the seq_id of the sequence representing this cluster
    - `member seq_id`: the seq_id of a member of the cluster
    - `representative seq`: the amino acid sequence of the cluster representative (representative seq_id)
    - `member seq`: the amino acid sequence of the cluster member

4. Run the splitting script:
```python
python split.py
```

This script will create the following files:
- **`splits/train_cluster_split.csv`, `splits/val_cluster_split.csv`, `splits/test_cluster_split.csv`**: The subsets of `clustering/mmseqs_full_results.csv` that have been partitioned into the train, validation, and test sets respectively. 
- **`splits/train_df.csv`, `splits/val_df.csv`, `splits/test_df.csv`**: The train, validation, and testing splits used to train FusOn-pLM. Columns: `sequence`,`member length`
- the **`split_vis`** folder, which contains all visualizations in Fig. S4 and the data that was directly plotted in these visualizations (`*_source_data.csv` files). Note that the individual subplots have slightly different dimensions than they do in the combined Fig. S4
    - **`splits/split_vis/combined_plot.png`**: plot displaying the composition of the train, validation, and test splits (Fig. S4).
    - **`splits/split_vis/length_distributions.png`**: plot displaying the length distributions of the train, validation, and test splits (Fig. 4A)
    - **`splits/split_vis/shannon_entropy_plot.png`**: plot displaying the Shannon entropy distributions of train, validation, and test sets (Fig. 4B)
    - **`splits/split_vis/scatterplot.png`**: plot displaying the cluster size distributions of the train, validation, and test sets (Fig. 4C)
    - **`splits/split_vis/aa_comp.png`**: plot displaying the amino acid composition of the train, validation, and test splits (Fig. S4D). 

### BLAST
We ran BLAST to get the best alignment of each sequence in FusOn-DB to a protein in SwissProt. See the README in the `blast` folder for more details.