root
data cleaning, blast, and splitting code with source data, also deleting unnecessary files
6efd653
We ran local BLAST to get the best alignment of each fusion oncoprotein sequence to every protein in SwissProt.
Downloading BLAST Executables and Database
First, we needed to downloaded the BLAST executables by entering the following in terminal (if you don't have a Linux system, find the correct download for your system at https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST):
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -zxvf ncbi-blast-2.16.0+-x64-linux.tar.gz
rm ncbi-blast-2.16.0+-x64-linux.tar.gz
cd ncbi-blast-2.16.0+
mkdir swissprot
cd swissprot
perl ../bin/update_blastdb.pl --passive --decompress swissprot
chmod +x "blast/ncbi-blast-2.16.0+/bin/blastp"
sudo chmod -R 755 FusOn-pLMfuson_plm/data/blast/ncbi-blast-2.16.0+
Running BLAST
The directory is structured as follows:
data/
βββ blast/
βββ blast_outputs/
βββ swissprot_blast_output_analyzed.pkl
βββ swissprot_blast_stats.csv
βββ swissprot_no_match.csv
βββ swissprot_no_match.txt
βββ swissprot_top_alignments.csv
βββ best_htg_alignments_swissprot_seqs.pkl
βββ ht_uniprot_query.txt
βββ figures/
βββ identities_hist_source_data.png
βββ identities_hist.png
βββ blast_fusions.py
βββ extract_blast_seqs.py
βββ plot.py
βββ fusion_blast_log.txt
βββ fuson_ht_db.csv
blast_fusions.py: script that will prepare FusOn-DB for BLAST, run BLAST against SwissProt (given you've installed BLAST software properly), extract top alignments and calculate statistics on the BLAST results, and make results plots. Print statements in this script create the log filefusion_blast_log.txt.extract_blast_seqs.py: script that will extract sequences of all the head/tail proteins that formed the best alignment during BLAST, directly from the SwissProt BLAST database. Creates the fileblast_outputs/best_htg_alignments_swissprot_seqs.pkl.plot.py: script to make the plot found atfigures/identities_hist.png(Fig. 1B histogram). The exact data plotted in this histogram is infigures/identities_hist_source_data. This plot displays the maximum % identity of each fusion oncoprotein sequence with a SwissProt sequence, based on BLAST. This plot is also automatically created byblast_fusions.py.fuson_ht_db.csv: Database that merges FusOn-DB (/*/FusOn-pLM/fuson_plm/data/fuson_db.csv) with/*/FusOn-pLM/fuson_plm/data/head_tail_data/htgenes_uniprotids.csv, which simplifies the process of analyzing BLAST results. In FusOn-DB, certain amino acid sequences are associated with multiple fusion oncoproteins, whose names are comma-separated in thefusiongenescolumn. Infuson_ht_db.csv, thefusiongenescolumn is exploded such that exach row only has one fusion gene. Therefore, this database has more rows than FusOn-DB, and some duplicate sequences.
To run BLAST search and analysis, we recommend using nohup as the process will take a long time.
nohup python blast_fusions.py > blastrun.out 2> blastrun.err &
Understanding the output files
Here, we will break down each file in the blast/blast_outputs directory.
best_htg_alignments_swissprot_seqs.pkl: a dictionary where the keys are UniProt IDs, "."-concatenated to their isoform (e.g. "Q8NFI3.1"), and the values are the amino acid sequence corresponding to that isoform. The sequences were pulled directly from the SwissProt BLAST dataase.ht_uniprot_query.txt: a list of all head and tail proteins producing top SwissProt alignments, in the format described above (e.g. "Q8NFI3.1"). Used to query the SwissProt database and create thebest_htg_alignments_swissprot_seqs.pklfile.swissprot_blast_output_analyzed.pkl: dictionary that summarizes key BLAST results for each fusion protein. The keys are seq_ids, each corresponding to a fusion oncoprotein sequence in FusOn-DB. The values are dictionaries holding BLAST results for that seq_id. Each UniProt ID corresponding to a known head or tail (stored infuson_ht_db.csv) is checked for an alignment. If there is no alignment, the value is None (e.g.swissprot_blast_output_analyzed['seq18']['F8WED0']isNone). If there is an alignment, we store the Isoform, Score, Expect, Query_Aligned, Subject_Aligned, H_or_T (whether this ID is for teh head or tail protein), Best (whether this is the best - highest-scoring - alignment to this fusion oncoprotein), Identities, Positives, Gaps, Query_Start, Query_End, Subject_Start, and Subject_End. If the best alignment is not a known head or tail, this alignment is also stored. Below is the example dictionary for seq18.
swissprot_blast_output_analyzed['seq18'] =
{
"F8WED0": None,
"Q9Y2X3": {
"Isoform": 1,
"Score": 452.0,
"Expect": "6e-148",
"Query_Aligned": "AGTGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
"Subject_Aligned": "AHAGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
"H_or_T": "Tail",
"Best": False,
"Identities": "228/230 (99%)",
"Positives": "228/230 (99%)",
"Gaps": "0/230 (0%)",
"Query_Start": 754,
"Query_End": 983,
"Sbjct_Start": 300,
"Sbjct_End": 529,
},
"L0R804": None,
"A0A096LP60": None,
"A0A096LNZ0": None,
"H7BZ72": None,
"A0A096LP25": None,
"Q9BUD9": None,
"B7ZLC4": None,
"Q2M2I8": {
"Isoform": 3,
"Score": 1558.0,
"Expect": "0.0",
"Query_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTG",
"Subject_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTA",
"H_or_T": "Head",
"Best": True,
"Identities": "756/757 (99%)",
"Positives": "756/757 (99%)",
"Gaps": "0/757 (0%)",
"Query_Start": 1,
"Query_End": 757,
"Sbjct_Start": 1,
"Sbjct_End": 757,
},
"E9PG46": None,
}
swissprot_blast_stats.csv: a database summarizing the BLAST scores across all fusion oncoproteins. Columns are: seq_id,hgAlignments,tgAlignments,totalAlignments,best_hgScore,best_tgScore,best_Score,h_or_t_alignment,h_and_t_alignmenth_or_t_alignmentis True if either the head or tail has an alignment returned by BLAST.h_and_t_alignmentis True if both the head and tail have an alignment returned by BLAST.
swissprot_no_match.txt: names of the BLAST output files that said "No hits found"swissprot_no_match.csv: more information on the fusion oncoproteins indicated in swissprot_no_match.txtswissprot_top_alignments.csv: a database summarizing the most important information acquired by BLAST across all fusion oncoproteins. Columns are: seq_id,top_hg_UniProtID,top_hg_UniProt_isoform,top_hg_UniProt_fus_indices,top_tg_UniProtID,top_tg_UniProt_isoform,top_tg_UniProt_fus_indices,top_UniProtID,top_UniProt_isoform,top_UniProt_fus_indices,top_UniProt_nIdentities,top_UniProt_nPositives,aa_seq_len- All indices (e.g.
top_hg_UniProt_fus_indices) are 1-indexed. - This database can be used to eestimate breakpoints using the
top_hg_UniProt_fus_indicesandtop_tg_UniProt_fus_indicescolumns. For example, iftop_hg_UniProt_fus_indicesis "1,300" andtop_tg_UniProt_fus_indicesis "301,546", then that means residues 1-300 of the fusion protein aligned with the head protein indicated intop_hg_UniProtIDandtop_hg_isoform, and residues 301-546 of the fusion protein aligned with the tail protein indicated intop_tg_UniProtIDandtop_tg_isoform. The breakpoint is between residues 300 and 301.
- All indices (e.g.