Fill-Mask
Transformers
Safetensors
esm
root
data cleaning, blast, and splitting code with source data, also deleting unnecessary files
6efd653
We ran local BLAST to get the best alignment of each fusion oncoprotein sequence to every protein in SwissProt.
### Downloading BLAST Executables and Database
First, we needed to downloaded the BLAST executables by entering the following in terminal (if you don't have a Linux system, find the correct download for your system at https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST):
```
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -zxvf ncbi-blast-2.16.0+-x64-linux.tar.gz
rm ncbi-blast-2.16.0+-x64-linux.tar.gz
cd ncbi-blast-2.16.0+
mkdir swissprot
cd swissprot
perl ../bin/update_blastdb.pl --passive --decompress swissprot
chmod +x "blast/ncbi-blast-2.16.0+/bin/blastp"
sudo chmod -R 755 FusOn-pLMfuson_plm/data/blast/ncbi-blast-2.16.0+
```
### Running BLAST
The directory is structured as follows:
```
data/
└── blast/
└── blast_outputs/
β”œβ”€β”€ swissprot_blast_output_analyzed.pkl
β”œβ”€β”€ swissprot_blast_stats.csv
β”œβ”€β”€ swissprot_no_match.csv
β”œβ”€β”€ swissprot_no_match.txt
β”œβ”€β”€ swissprot_top_alignments.csv
β”œβ”€β”€ best_htg_alignments_swissprot_seqs.pkl
β”œβ”€β”€ ht_uniprot_query.txt
└── figures/
β”œβ”€β”€ identities_hist_source_data.png
β”œβ”€β”€ identities_hist.png
β”œβ”€β”€ blast_fusions.py
β”œβ”€β”€ extract_blast_seqs.py
β”œβ”€β”€ plot.py
β”œβ”€β”€ fusion_blast_log.txt
β”œβ”€β”€ fuson_ht_db.csv
```
- **`blast_fusions.py`**: script that will prepare FusOn-DB for BLAST, run BLAST against SwissProt (given you've installed BLAST software properly), extract top alignments and calculate statistics on the BLAST results, and make results plots. Print statements in this script create the log file `fusion_blast_log.txt`.
- **`extract_blast_seqs.py`**: script that will extract sequences of all the head/tail proteins that formed the best alignment during BLAST, directly from the SwissProt BLAST database. Creates the file `blast_outputs/best_htg_alignments_swissprot_seqs.pkl`.
- **`plot.py`**: script to make the plot found at `figures/identities_hist.png` (Fig. 1B histogram). The exact data plotted in this histogram is in `figures/identities_hist_source_data`. This plot displays the maximum % identity of each fusion oncoprotein sequence with a SwissProt sequence, based on BLAST. This plot is also automatically created by `blast_fusions.py`.
- **`fuson_ht_db.csv`**: Database that merges FusOn-DB (`/*/FusOn-pLM/fuson_plm/data/fuson_db.csv`) with `/*/FusOn-pLM/fuson_plm/data/head_tail_data/htgenes_uniprotids.csv`, which simplifies the process of analyzing BLAST results. In FusOn-DB, certain amino acid sequences are associated with multiple fusion oncoproteins, whose names are comma-separated in the `fusiongenes` column. In `fuson_ht_db.csv`, the `fusiongenes` column is exploded such that exach row only has one fusion gene. Therefore, this database has more rows than FusOn-DB, and some duplicate sequences.
To run BLAST search and analysis, we recommend using nohup as the process will take a long time.
```python
nohup python blast_fusions.py > blastrun.out 2> blastrun.err &
```
### Understanding the output files
Here, we will break down each file in the `blast/blast_outputs` directory.
- **`best_htg_alignments_swissprot_seqs.pkl`**: a dictionary where the keys are UniProt IDs, "."-concatenated to their isoform (e.g. "Q8NFI3.1"), and the values are the amino acid sequence corresponding to that isoform. The sequences were pulled directly from the SwissProt BLAST dataase.
- **`ht_uniprot_query.txt`**: a list of all head and tail proteins producing top SwissProt alignments, in the format described above (e.g. "Q8NFI3.1"). Used to query the SwissProt database and create the `best_htg_alignments_swissprot_seqs.pkl` file.
- **`swissprot_blast_output_analyzed.pkl`**: dictionary that summarizes key BLAST results for each fusion protein. The keys are seq_ids, each corresponding to a fusion oncoprotein sequence in FusOn-DB. The values are dictionaries holding BLAST results for that seq_id. Each UniProt ID corresponding to a known head or tail (stored in `fuson_ht_db.csv`) is checked for an alignment. If there is no alignment, the value is None (e.g. `swissprot_blast_output_analyzed['seq18']['F8WED0']` is `None`). If there is an alignment, we store the Isoform, Score, Expect, Query_Aligned, Subject_Aligned, H_or_T (whether this ID is for teh head or tail protein), Best (whether this is the best - highest-scoring - alignment to this fusion oncoprotein), Identities, Positives, Gaps, Query_Start, Query_End, Subject_Start, and Subject_End. If the best alignment is not a known head or tail, this alignment is also stored. Below is the example dictionary for seq18.
```python
swissprot_blast_output_analyzed['seq18'] =
{
"F8WED0": None,
"Q9Y2X3": {
"Isoform": 1,
"Score": 452.0,
"Expect": "6e-148",
"Query_Aligned": "AGTGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
"Subject_Aligned": "AHAGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
"H_or_T": "Tail",
"Best": False,
"Identities": "228/230 (99%)",
"Positives": "228/230 (99%)",
"Gaps": "0/230 (0%)",
"Query_Start": 754,
"Query_End": 983,
"Sbjct_Start": 300,
"Sbjct_End": 529,
},
"L0R804": None,
"A0A096LP60": None,
"A0A096LNZ0": None,
"H7BZ72": None,
"A0A096LP25": None,
"Q9BUD9": None,
"B7ZLC4": None,
"Q2M2I8": {
"Isoform": 3,
"Score": 1558.0,
"Expect": "0.0",
"Query_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTG",
"Subject_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTA",
"H_or_T": "Head",
"Best": True,
"Identities": "756/757 (99%)",
"Positives": "756/757 (99%)",
"Gaps": "0/757 (0%)",
"Query_Start": 1,
"Query_End": 757,
"Sbjct_Start": 1,
"Sbjct_End": 757,
},
"E9PG46": None,
}
```
- **`swissprot_blast_stats.csv`**: a database summarizing the BLAST scores across all fusion oncoproteins. Columns are: seq_id,hgAlignments,tgAlignments,totalAlignments,best_hgScore,best_tgScore,best_Score,h_or_t_alignment,h_and_t_alignment
- `h_or_t_alignment` is True if either the head or tail has an alignment returned by BLAST. `h_and_t_alignment` is True if both the head and tail have an alignment returned by BLAST.
- **`swissprot_no_match.txt`**: names of the BLAST output files that said "No hits found"
- **`swissprot_no_match.csv`**: more information on the fusion oncoproteins indicated in swissprot_no_match.txt
- **`swissprot_top_alignments.csv`**: a database summarizing the most important information acquired by BLAST across all fusion oncoproteins. Columns are: seq_id,top_hg_UniProtID,top_hg_UniProt_isoform,top_hg_UniProt_fus_indices,top_tg_UniProtID,top_tg_UniProt_isoform,top_tg_UniProt_fus_indices,top_UniProtID,top_UniProt_isoform,top_UniProt_fus_indices,top_UniProt_nIdentities,top_UniProt_nPositives,aa_seq_len
- All indices (e.g. `top_hg_UniProt_fus_indices`) are 1-indexed.
- This database can be used to eestimate breakpoints using the `top_hg_UniProt_fus_indices` and `top_tg_UniProt_fus_indices` columns. For example, if `top_hg_UniProt_fus_indices` is "1,300" and `top_tg_UniProt_fus_indices` is "301,546", then that means residues 1-300 of the fusion protein aligned with the head protein indicated in `top_hg_UniProtID` and `top_hg_isoform`, and residues 301-546 of the fusion protein aligned with the tail protein indicated in `top_tg_UniProtID` and `top_tg_isoform`. The breakpoint is between residues 300 and 301.