Fill-Mask
Transformers
Safetensors
esm
root
data cleaning, blast, and splitting code with source data, also deleting unnecessary files
6efd653

We ran local BLAST to get the best alignment of each fusion oncoprotein sequence to every protein in SwissProt.

Downloading BLAST Executables and Database

First, we needed to downloaded the BLAST executables by entering the following in terminal (if you don't have a Linux system, find the correct download for your system at https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST):

wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz
tar -zxvf ncbi-blast-2.16.0+-x64-linux.tar.gz
rm ncbi-blast-2.16.0+-x64-linux.tar.gz

cd ncbi-blast-2.16.0+
mkdir swissprot
cd swissprot
perl ../bin/update_blastdb.pl --passive --decompress swissprot

chmod +x "blast/ncbi-blast-2.16.0+/bin/blastp"
sudo chmod -R 755 FusOn-pLMfuson_plm/data/blast/ncbi-blast-2.16.0+

Running BLAST

The directory is structured as follows:

data/                                 
└── blast/ 
    └── blast_outputs/
        β”œβ”€β”€ swissprot_blast_output_analyzed.pkl
        β”œβ”€β”€ swissprot_blast_stats.csv
        β”œβ”€β”€ swissprot_no_match.csv
        β”œβ”€β”€ swissprot_no_match.txt
        β”œβ”€β”€ swissprot_top_alignments.csv
        β”œβ”€β”€ best_htg_alignments_swissprot_seqs.pkl
        β”œβ”€β”€ ht_uniprot_query.txt
    └── figures/
        β”œβ”€β”€ identities_hist_source_data.png
        β”œβ”€β”€ identities_hist.png
    β”œβ”€β”€ blast_fusions.py
    β”œβ”€β”€ extract_blast_seqs.py 
    β”œβ”€β”€ plot.py
    β”œβ”€β”€ fusion_blast_log.txt 
    β”œβ”€β”€ fuson_ht_db.csv
  • blast_fusions.py: script that will prepare FusOn-DB for BLAST, run BLAST against SwissProt (given you've installed BLAST software properly), extract top alignments and calculate statistics on the BLAST results, and make results plots. Print statements in this script create the log file fusion_blast_log.txt.
  • extract_blast_seqs.py: script that will extract sequences of all the head/tail proteins that formed the best alignment during BLAST, directly from the SwissProt BLAST database. Creates the file blast_outputs/best_htg_alignments_swissprot_seqs.pkl.
  • plot.py: script to make the plot found at figures/identities_hist.png (Fig. 1B histogram). The exact data plotted in this histogram is in figures/identities_hist_source_data. This plot displays the maximum % identity of each fusion oncoprotein sequence with a SwissProt sequence, based on BLAST. This plot is also automatically created by blast_fusions.py.
  • fuson_ht_db.csv: Database that merges FusOn-DB (/*/FusOn-pLM/fuson_plm/data/fuson_db.csv) with /*/FusOn-pLM/fuson_plm/data/head_tail_data/htgenes_uniprotids.csv, which simplifies the process of analyzing BLAST results. In FusOn-DB, certain amino acid sequences are associated with multiple fusion oncoproteins, whose names are comma-separated in the fusiongenes column. In fuson_ht_db.csv, the fusiongenes column is exploded such that exach row only has one fusion gene. Therefore, this database has more rows than FusOn-DB, and some duplicate sequences.

To run BLAST search and analysis, we recommend using nohup as the process will take a long time.

nohup python blast_fusions.py > blastrun.out 2> blastrun.err &

Understanding the output files

Here, we will break down each file in the blast/blast_outputs directory.

  • best_htg_alignments_swissprot_seqs.pkl: a dictionary where the keys are UniProt IDs, "."-concatenated to their isoform (e.g. "Q8NFI3.1"), and the values are the amino acid sequence corresponding to that isoform. The sequences were pulled directly from the SwissProt BLAST dataase.
  • ht_uniprot_query.txt: a list of all head and tail proteins producing top SwissProt alignments, in the format described above (e.g. "Q8NFI3.1"). Used to query the SwissProt database and create the best_htg_alignments_swissprot_seqs.pkl file.
  • swissprot_blast_output_analyzed.pkl: dictionary that summarizes key BLAST results for each fusion protein. The keys are seq_ids, each corresponding to a fusion oncoprotein sequence in FusOn-DB. The values are dictionaries holding BLAST results for that seq_id. Each UniProt ID corresponding to a known head or tail (stored in fuson_ht_db.csv) is checked for an alignment. If there is no alignment, the value is None (e.g. swissprot_blast_output_analyzed['seq18']['F8WED0'] is None). If there is an alignment, we store the Isoform, Score, Expect, Query_Aligned, Subject_Aligned, H_or_T (whether this ID is for teh head or tail protein), Best (whether this is the best - highest-scoring - alignment to this fusion oncoprotein), Identities, Positives, Gaps, Query_Start, Query_End, Subject_Start, and Subject_End. If the best alignment is not a known head or tail, this alignment is also stored. Below is the example dictionary for seq18.
swissprot_blast_output_analyzed['seq18'] = 
{
    "F8WED0": None,
    "Q9Y2X3": {
        "Isoform": 1,
        "Score": 452.0,
        "Expect": "6e-148",
        "Query_Aligned": "AGTGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
        "Subject_Aligned": "AHAGSLLNLAKHAASTVQILGAEKALFRALKSRRDTPKYGLIYHASLVGQTSPKHKGKISRMLAAKTVLAIRYDAFGEDSSSAMGVENRAKLEARLRTLEDRGIRKISGTGKALAKTEKYEHKSEVKTYDPSGDSTLPTCSKKRKIEQVDKEDEITEKKAKKAKIKVKVEEEEEEKVAEEEETSVKKKKKRGKKKHIKEEPLSEEEPCTSTAIASPEKKKKKKKKRENED",
        "H_or_T": "Tail",
        "Best": False,
        "Identities": "228/230 (99%)",
        "Positives": "228/230 (99%)",
        "Gaps": "0/230 (0%)",
        "Query_Start": 754,
        "Query_End": 983,
        "Sbjct_Start": 300,
        "Sbjct_End": 529,
    },
    "L0R804": None,
    "A0A096LP60": None,
    "A0A096LNZ0": None,
    "H7BZ72": None,
    "A0A096LP25": None,
    "Q9BUD9": None,
    "B7ZLC4": None,
    "Q2M2I8": {
        "Isoform": 3,
        "Score": 1558.0,
        "Expect": "0.0",
        "Query_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTG",
        "Subject_Aligned": "MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTA",
        "H_or_T": "Head",
        "Best": True,
        "Identities": "756/757 (99%)",
        "Positives": "756/757 (99%)",
        "Gaps": "0/757 (0%)",
        "Query_Start": 1,
        "Query_End": 757,
        "Sbjct_Start": 1,
        "Sbjct_End": 757,
    },
    "E9PG46": None,
}
  • swissprot_blast_stats.csv: a database summarizing the BLAST scores across all fusion oncoproteins. Columns are: seq_id,hgAlignments,tgAlignments,totalAlignments,best_hgScore,best_tgScore,best_Score,h_or_t_alignment,h_and_t_alignment
    • h_or_t_alignment is True if either the head or tail has an alignment returned by BLAST. h_and_t_alignment is True if both the head and tail have an alignment returned by BLAST.
  • swissprot_no_match.txt: names of the BLAST output files that said "No hits found"
  • swissprot_no_match.csv: more information on the fusion oncoproteins indicated in swissprot_no_match.txt
  • swissprot_top_alignments.csv: a database summarizing the most important information acquired by BLAST across all fusion oncoproteins. Columns are: seq_id,top_hg_UniProtID,top_hg_UniProt_isoform,top_hg_UniProt_fus_indices,top_tg_UniProtID,top_tg_UniProt_isoform,top_tg_UniProt_fus_indices,top_UniProtID,top_UniProt_isoform,top_UniProt_fus_indices,top_UniProt_nIdentities,top_UniProt_nPositives,aa_seq_len
    • All indices (e.g. top_hg_UniProt_fus_indices) are 1-indexed.
    • This database can be used to eestimate breakpoints using the top_hg_UniProt_fus_indices and top_tg_UniProt_fus_indices columns. For example, if top_hg_UniProt_fus_indices is "1,300" and top_tg_UniProt_fus_indices is "301,546", then that means residues 1-300 of the fusion protein aligned with the head protein indicated in top_hg_UniProtID and top_hg_isoform, and residues 301-546 of the fusion protein aligned with the tail protein indicated in top_tg_UniProtID and top_tg_isoform. The breakpoint is between residues 300 and 301.