Spaces:
No application file
No application file
| # Copyright 2012 by Wibowo Arindrarto. All rights reserved. | |
| # This file is part of the Biopython distribution and governed by your | |
| # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". | |
| # Please see the LICENSE file that should have been included as part of this | |
| # package. | |
| """Biopython interface for sequence search program outputs. | |
| The SearchIO submodule provides parsers, indexers, and writers for outputs from | |
| various sequence search programs. It provides an API similar to SeqIO and | |
| AlignIO, with the following main functions: ``parse``, ``read``, ``to_dict``, | |
| ``index``, ``index_db``, ``write``, and ``convert``. | |
| SearchIO parses a search output file's contents into a hierarchy of four nested | |
| objects: QueryResult, Hit, HSP, and HSPFragment. Each of them models a part of | |
| the search output file: | |
| - QueryResult represents a search query. This is the main object returned | |
| by the input functions and it contains all other objects. | |
| - Hit represents a database hit, | |
| - HSP represents high-scoring alignment region(s) in the hit, | |
| - HSPFragment represents a contiguous alignment within the HSP | |
| In addition to the four objects above, SearchIO is also tightly integrated with | |
| the SeqRecord objects (see SeqIO) and MultipleSeqAlignment objects (see | |
| AlignIO). SeqRecord objects are used to store the actual matching hit and query | |
| sequences, while MultipleSeqAlignment objects store the alignment between them. | |
| A detailed description of these objects' features and their example usages are | |
| available in their respective documentations. | |
| Input | |
| ===== | |
| The main function for parsing search output files is Bio.SearchIO.parse(...). | |
| This function parses a given search output file and returns a generator object | |
| that yields one QueryResult object per iteration. | |
| ``parse`` takes two arguments: 1) a file handle or a filename of the input file | |
| (the search output file) and 2) the format name. | |
| >>> from Bio import SearchIO | |
| >>> for qresult in SearchIO.parse('Blast/mirna.xml', 'blast-xml'): | |
| ... print("%s %s" % (qresult.id, qresult.description)) | |
| ... | |
| 33211 mir_1 | |
| 33212 mir_2 | |
| 33213 mir_3 | |
| SearchIO also provides the Bio.SearchIO.read(...) function, which is intended | |
| for use on search output files containing only one query. ``read`` returns one | |
| QueryResult object and will raise an exception if the source file contains more | |
| than one query: | |
| >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') | |
| >>> print("%s %s" % (qresult.id, qresult.description)) | |
| ... | |
| gi|11464971:4-101 pleckstrin [Mus musculus] | |
| >>> SearchIO.read('Blast/mirna.xml', 'blast-xml') | |
| Traceback (most recent call last): | |
| ... | |
| ValueError: ... | |
| For accessing search results of large output files, you may use the indexing | |
| functions Bio.SearchIO.index(...) or Bio.SearchIO.index_db(...). They have a | |
| similar interface to their counterparts in SeqIO and AlignIO, with the addition | |
| of optional, format-specific keyword arguments. | |
| Output | |
| ====== | |
| SearchIO has write support for several formats, accessible from the | |
| Bio.SearchIO.write(...) function. This function returns a tuple of four | |
| numbers: the number of QueryResult, Hit, HSP, and HSPFragment written:: | |
| qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') | |
| SearchIO.write(qresults, 'results.tab', 'blast-tab') | |
| <stdout> (3, 239, 277, 277) | |
| Note that different writers may require different attribute values of the | |
| SearchIO objects. This limits the scope of writable search results to search | |
| results possessing the required attributes. | |
| For example, the writer for HMMER domain table output requires | |
| the conditional e-value attribute from each HSP object, among others. If you | |
| try to write to the HMMER domain table format and your HSPs do not have this | |
| attribute, an exception will be raised. | |
| Conversion | |
| ========== | |
| SearchIO provides a shortcut function Bio.SearchIO.convert(...) to convert a | |
| given file into another format. Under the hood, ``convert`` simply parses a given | |
| output file and writes it to another using the ``parse`` and ``write`` functions. | |
| Note that the same restrictions found in Bio.SearchIO.write(...) apply to the | |
| convert function as well. | |
| Conventions | |
| =========== | |
| The main goal of creating SearchIO is to have a common, easy to use interface | |
| across different search output files. As such, we have also created some | |
| conventions / standards for SearchIO that extend beyond the common object model. | |
| These conventions apply to all files parsed by SearchIO, regardless of their | |
| individual formats. | |
| Python-style sequence coordinates | |
| --------------------------------- | |
| When storing sequence coordinates (start and end values), SearchIO uses | |
| the Python-style slice convention: zero-based and half-open intervals. For | |
| example, if in a BLAST XML output file the start and end coordinates of an | |
| HSP are 10 and 28, they would become 9 and 28 in SearchIO. The start | |
| coordinate becomes 9 because Python indices start from zero, while the end | |
| coordinate remains 28 as Python slices omit the last item in an interval. | |
| Beside giving you the benefits of standardization, this convention also | |
| makes the coordinates usable for slicing sequences. For example, given a | |
| full query sequence and the start and end coordinates of an HSP, one can | |
| use the coordinates to extract part of the query sequence that results in | |
| the database hit. | |
| When these objects are written to an output file using | |
| SearchIO.write(...), the coordinate values are restored to their | |
| respective format's convention. Using the example above, if the HSP would | |
| be written to an XML file, the start and end coordinates would become 10 | |
| and 28 again. | |
| Sequence coordinate order | |
| ------------------------- | |
| Some search output formats reverse the start and end coordinate sequences | |
| according to the sequence's strand. For example, in BLAST plain text | |
| format if the matching strand lies in the minus orientation, then the | |
| start coordinate will always be bigger than the end coordinate. | |
| In SearchIO, start coordinates are always smaller than the end | |
| coordinates, regardless of their originating strand. This ensures | |
| consistency when using the coordinates to slice full sequences. | |
| Note that this coordinate order convention is only enforced in the | |
| HSPFragment level. If an HSP object has several HSPFragment objects, each | |
| individual fragment will conform to this convention. But the order of the | |
| fragments within the HSP object follows what the search output file uses. | |
| Similar to the coordinate style convention, the start and end coordinates' | |
| order are restored to their respective formats when the objects are | |
| written using Bio.SearchIO.write(...). | |
| Frames and strand values | |
| ------------------------ | |
| SearchIO only allows -1, 0, 1 and None as strand values. For frames, the | |
| only allowed values are integers from -3 to 3 (inclusive) and None. Both | |
| of these are standard Biopython conventions. | |
| Supported Formats | |
| ================= | |
| Below is a list of search program output formats supported by SearchIO. | |
| Support for parsing, indexing, and writing: | |
| - blast-tab - BLAST+ tabular output. Both variants without comments | |
| (-m 6 flag) and with comments (-m 7 flag) are supported. | |
| - blast-xml - BLAST+ XML output. | |
| - blat-psl - The default output of BLAT (PSL format). Variants with or | |
| without header are both supported. PSLX (PSL + sequences) | |
| is also supported. | |
| - hmmer3-tab - HMMER3 table output. | |
| - hmmer3-domtab - HMMER3 domain table output. When using this format, the | |
| program name has to be specified. For example, for parsing | |
| hmmscan output, the name would be 'hmmscan-domtab'. | |
| Support for parsing and indexing: | |
| - exonerate-text - Exonerate plain text output. | |
| - exonerate-vulgar - Exonerate vulgar line. | |
| - exonerate-cigar - Exonerate cigar line. | |
| - fasta-m10 - Bill Pearson's FASTA -m 10 output. | |
| - hmmer3-text - HMMER3 regular text output format. Supported HMMER3 | |
| subprograms are hmmscan, hmmsearch, and phmmer. | |
| - hmmer2-text - HMMER2 regular text output format. Supported HMMER2 | |
| subprograms are hmmpfam, hmmsearch. | |
| Support for parsing: | |
| - blast-text - BLAST+ plain text output. | |
| - hhsuite2-text - HHSUITE plain text output. | |
| Each of these formats have different keyword arguments available for use with | |
| the main SearchIO functions. More details and examples are available in each | |
| of the format's documentation. | |
| """ | |
| from Bio.File import as_handle | |
| from Bio.SearchIO._model import QueryResult, Hit, HSP, HSPFragment | |
| from Bio.SearchIO._utils import get_processor | |
| __all__ = ("read", "parse", "to_dict", "index", "index_db", "write", "convert") | |
| # dictionary of supported formats for parse() and read() | |
| _ITERATOR_MAP = { | |
| "blast-tab": ("BlastIO", "BlastTabParser"), | |
| "blast-text": ("BlastIO", "BlastTextParser"), | |
| "blast-xml": ("BlastIO", "BlastXmlParser"), | |
| "blat-psl": ("BlatIO", "BlatPslParser"), | |
| "exonerate-cigar": ("ExonerateIO", "ExonerateCigarParser"), | |
| "exonerate-text": ("ExonerateIO", "ExonerateTextParser"), | |
| "exonerate-vulgar": ("ExonerateIO", "ExonerateVulgarParser"), | |
| "fasta-m10": ("FastaIO", "FastaM10Parser"), | |
| "hhsuite2-text": ("HHsuiteIO", "Hhsuite2TextParser"), | |
| "hhsuite3-text": ("HHsuiteIO", "Hhsuite2TextParser"), | |
| "hmmer2-text": ("HmmerIO", "Hmmer2TextParser"), | |
| "hmmer3-text": ("HmmerIO", "Hmmer3TextParser"), | |
| "hmmer3-tab": ("HmmerIO", "Hmmer3TabParser"), | |
| # for hmmer3-domtab, the specific program is part of the format name | |
| # as we need it distinguish hit / target coordinates | |
| "hmmscan3-domtab": ("HmmerIO", "Hmmer3DomtabHmmhitParser"), | |
| "hmmsearch3-domtab": ("HmmerIO", "Hmmer3DomtabHmmqueryParser"), | |
| "interproscan-xml": ("InterproscanIO", "InterproscanXmlParser"), | |
| "phmmer3-domtab": ("HmmerIO", "Hmmer3DomtabHmmqueryParser"), | |
| } | |
| # dictionary of supported formats for index() | |
| _INDEXER_MAP = { | |
| "blast-tab": ("BlastIO", "BlastTabIndexer"), | |
| "blast-xml": ("BlastIO", "BlastXmlIndexer"), | |
| "blat-psl": ("BlatIO", "BlatPslIndexer"), | |
| "exonerate-cigar": ("ExonerateIO", "ExonerateCigarIndexer"), | |
| "exonerate-text": ("ExonerateIO", "ExonerateTextIndexer"), | |
| "exonerate-vulgar": ("ExonerateIO", "ExonerateVulgarIndexer"), | |
| "fasta-m10": ("FastaIO", "FastaM10Indexer"), | |
| "hmmer2-text": ("HmmerIO", "Hmmer2TextIndexer"), | |
| "hmmer3-text": ("HmmerIO", "Hmmer3TextIndexer"), | |
| "hmmer3-tab": ("HmmerIO", "Hmmer3TabIndexer"), | |
| "hmmscan3-domtab": ("HmmerIO", "Hmmer3DomtabHmmhitIndexer"), | |
| "hmmsearch3-domtab": ("HmmerIO", "Hmmer3DomtabHmmqueryIndexer"), | |
| "phmmer3-domtab": ("HmmerIO", "Hmmer3DomtabHmmqueryIndexer"), | |
| } | |
| # dictionary of supported formats for write() | |
| _WRITER_MAP = { | |
| "blast-tab": ("BlastIO", "BlastTabWriter"), | |
| "blast-xml": ("BlastIO", "BlastXmlWriter"), | |
| "blat-psl": ("BlatIO", "BlatPslWriter"), | |
| "hmmer3-tab": ("HmmerIO", "Hmmer3TabWriter"), | |
| "hmmscan3-domtab": ("HmmerIO", "Hmmer3DomtabHmmhitWriter"), | |
| "hmmsearch3-domtab": ("HmmerIO", "Hmmer3DomtabHmmqueryWriter"), | |
| "phmmer3-domtab": ("HmmerIO", "Hmmer3DomtabHmmqueryWriter"), | |
| } | |
| def parse(handle, format=None, **kwargs): | |
| """Iterate over search tool output file as QueryResult objects. | |
| Arguments: | |
| - handle - Handle to the file, or the filename as a string. | |
| - format - Lower case string denoting one of the supported formats. | |
| - kwargs - Format-specific keyword arguments. | |
| This function is used to iterate over each query in a given search output | |
| file: | |
| >>> from Bio import SearchIO | |
| >>> qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') | |
| >>> qresults | |
| <generator object ...> | |
| >>> for qresult in qresults: | |
| ... print("Search %s has %i hits" % (qresult.id, len(qresult))) | |
| ... | |
| Search 33211 has 100 hits | |
| Search 33212 has 44 hits | |
| Search 33213 has 95 hits | |
| Depending on the file format, ``parse`` may also accept additional keyword | |
| argument(s) that modifies the behavior of the format parser. Here is a | |
| simple example, where the keyword argument enables parsing of a commented | |
| BLAST tabular output file: | |
| >>> from Bio import SearchIO | |
| >>> for qresult in SearchIO.parse('Blast/mirna.tab', 'blast-tab', comments=True): | |
| ... print("Search %s has %i hits" % (qresult.id, len(qresult))) | |
| ... | |
| Search 33211 has 100 hits | |
| Search 33212 has 44 hits | |
| Search 33213 has 95 hits | |
| """ | |
| # get the iterator object and do error checking | |
| iterator = get_processor(format, _ITERATOR_MAP) | |
| # HACK: force BLAST XML decoding to use utf-8 | |
| handle_kwargs = {} | |
| if format == "blast-xml": | |
| handle_kwargs["encoding"] = "utf-8" | |
| # and start iterating | |
| with as_handle(handle, **handle_kwargs) as source_file: | |
| generator = iterator(source_file, **kwargs) | |
| yield from generator | |
| def read(handle, format=None, **kwargs): | |
| """Turn a search output file containing one query into a single QueryResult. | |
| - handle - Handle to the file, or the filename as a string. | |
| - format - Lower case string denoting one of the supported formats. | |
| - kwargs - Format-specific keyword arguments. | |
| ``read`` is used for parsing search output files containing exactly one query: | |
| >>> from Bio import SearchIO | |
| >>> qresult = SearchIO.read('Blast/xml_2226_blastp_004.xml', 'blast-xml') | |
| >>> print("%s %s" % (qresult.id, qresult.description)) | |
| ... | |
| gi|11464971:4-101 pleckstrin [Mus musculus] | |
| If the given handle has no results, an exception will be raised: | |
| >>> from Bio import SearchIO | |
| >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_002.txt', 'blast-tab') | |
| Traceback (most recent call last): | |
| ... | |
| ValueError: No query results found in handle | |
| Similarly, if the given handle has more than one result, an exception will | |
| be raised: | |
| >>> from Bio import SearchIO | |
| >>> qresult = SearchIO.read('Blast/tab_2226_tblastn_001.txt', 'blast-tab') | |
| Traceback (most recent call last): | |
| ... | |
| ValueError: More than one query result found in handle | |
| Like ``parse``, ``read`` may also accept keyword argument(s) depending on the | |
| search output file format. | |
| """ | |
| query_results = parse(handle, format, **kwargs) | |
| try: | |
| query_result = next(query_results) | |
| except StopIteration: | |
| raise ValueError("No query results found in handle") from None | |
| try: | |
| next(query_results) | |
| raise ValueError("More than one query result found in handle") | |
| except StopIteration: | |
| pass | |
| return query_result | |
| def to_dict(qresults, key_function=None): | |
| """Turn a QueryResult iterator or list into a dictionary. | |
| - qresults - Iterable returning QueryResult objects. | |
| - key_function - Optional callback function which when given a | |
| QueryResult object should return a unique key for the | |
| dictionary. Defaults to using .id of the result. | |
| This function enables access of QueryResult objects from a single search | |
| output file using its identifier. | |
| >>> from Bio import SearchIO | |
| >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') | |
| >>> search_dict = SearchIO.to_dict(qresults) | |
| >>> list(search_dict) | |
| ['gi|195230749:301-1383', 'gi|325053704:108-1166', ..., 'gi|53729353:216-1313'] | |
| >>> search_dict['gi|156630997:105-1160'] | |
| QueryResult(id='gi|156630997:105-1160', 5 hits) | |
| By default, the dictionary key is the QueryResult's string ID. This may be | |
| changed by supplying a callback function that returns the desired identifier. | |
| Here is an example using a function that removes the 'gi|' part in the | |
| beginning of the QueryResult ID. | |
| >>> from Bio import SearchIO | |
| >>> qresults = SearchIO.parse('Blast/wnts.xml', 'blast-xml') | |
| >>> key_func = lambda qresult: qresult.id.split('|')[1] | |
| >>> search_dict = SearchIO.to_dict(qresults, key_func) | |
| >>> list(search_dict) | |
| ['195230749:301-1383', '325053704:108-1166', ..., '53729353:216-1313'] | |
| >>> search_dict['156630997:105-1160'] | |
| QueryResult(id='gi|156630997:105-1160', 5 hits) | |
| Note that the callback function does not change the QueryResult's ID value. | |
| It only changes the key value used to retrieve the associated QueryResult. | |
| As this function loads all QueryResult objects into memory, it may be | |
| unsuitable for dealing with files containing many queries. In that case, it | |
| is recommended that you use either ``index`` or ``index_db``. | |
| Since Python 3.7, the default dict class maintains key order, meaning | |
| this dictionary will reflect the order of records given to it. For | |
| CPython and PyPy, this was already implemented for Python 3.6, so | |
| effectively you can always assume the record order is preserved. | |
| """ | |
| def _default_key_function(rec): | |
| return rec.id | |
| if key_function is None: | |
| key_function = _default_key_function | |
| qdict = {} | |
| for qresult in qresults: | |
| key = key_function(qresult) | |
| if key in qdict: | |
| raise ValueError("Duplicate key %r" % key) | |
| qdict[key] = qresult | |
| return qdict | |
| def index(filename, format=None, key_function=None, **kwargs): | |
| """Indexes a search output file and returns a dictionary-like object. | |
| - filename - string giving name of file to be indexed | |
| - format - Lower case string denoting one of the supported formats. | |
| - key_function - Optional callback function which when given a | |
| QueryResult should return a unique key for the dictionary. | |
| - kwargs - Format-specific keyword arguments. | |
| Index returns a pseudo-dictionary object with QueryResult objects as its | |
| values and a string identifier as its keys. The function is mainly useful | |
| for dealing with large search output files, as it enables access to any | |
| given QueryResult object much faster than using parse or read. | |
| Index works by storing in-memory the start locations of all queries in a | |
| file. When a user requests access to the query, this function will jump | |
| to its start position, parse the whole query, and return it as a | |
| QueryResult object: | |
| >>> from Bio import SearchIO | |
| >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml') | |
| >>> search_idx | |
| SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=None) | |
| >>> sorted(search_idx) | |
| ['gi|156630997:105-1160', 'gi|195230749:301-1383', ..., 'gi|53729353:216-1313'] | |
| >>> search_idx['gi|195230749:301-1383'] | |
| QueryResult(id='gi|195230749:301-1383', 5 hits) | |
| >>> search_idx.close() | |
| If the file is BGZF compressed, this is detected automatically. Ordinary | |
| GZIP files are not supported: | |
| >>> from Bio import SearchIO | |
| >>> search_idx = SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml') | |
| >>> search_idx | |
| SearchIO.index('Blast/wnts.xml.bgz', 'blast-xml', key_function=None) | |
| >>> search_idx['gi|195230749:301-1383'] | |
| QueryResult(id='gi|195230749:301-1383', 5 hits) | |
| >>> search_idx.close() | |
| You can supply a custom callback function to alter the default identifier | |
| string. This function should accept as its input the QueryResult ID string | |
| and return a modified version of it. | |
| >>> from Bio import SearchIO | |
| >>> key_func = lambda id: id.split('|')[1] | |
| >>> search_idx = SearchIO.index('Blast/wnts.xml', 'blast-xml', key_func) | |
| >>> search_idx | |
| SearchIO.index('Blast/wnts.xml', 'blast-xml', key_function=<function <lambda> at ...>) | |
| >>> sorted(search_idx) | |
| ['156630997:105-1160', ..., '371502086:108-1205', '53729353:216-1313'] | |
| >>> search_idx['156630997:105-1160'] | |
| QueryResult(id='gi|156630997:105-1160', 5 hits) | |
| >>> search_idx.close() | |
| Note that the callback function does not change the QueryResult's ID value. | |
| It only changes the key value used to retrieve the associated QueryResult. | |
| """ | |
| if not isinstance(filename, str): | |
| raise TypeError("Need a filename (not a handle)") | |
| from Bio.File import _IndexedSeqFileDict | |
| proxy_class = get_processor(format, _INDEXER_MAP) | |
| repr = f"SearchIO.index({filename!r}, {format!r}, key_function={key_function!r})" | |
| return _IndexedSeqFileDict( | |
| proxy_class(filename, **kwargs), key_function, repr, "QueryResult" | |
| ) | |
| def index_db(index_filename, filenames=None, format=None, key_function=None, **kwargs): | |
| """Indexes several search output files into an SQLite database. | |
| - index_filename - The SQLite filename. | |
| - filenames - List of strings specifying file(s) to be indexed, or when | |
| indexing a single file this can be given as a string. | |
| (optional if reloading an existing index, but must match) | |
| - format - Lower case string denoting one of the supported formats. | |
| (optional if reloading an existing index, but must match) | |
| - key_function - Optional callback function which when given a | |
| QueryResult identifier string should return a unique | |
| key for the dictionary. | |
| - kwargs - Format-specific keyword arguments. | |
| The ``index_db`` function is similar to ``index`` in that it indexes the start | |
| position of all queries from search output files. The main difference is | |
| instead of storing these indices in-memory, they are written to disk as an | |
| SQLite database file. This allows the indices to persist between Python | |
| sessions. This enables access to any queries in the file without any | |
| indexing overhead, provided it has been indexed at least once. | |
| >>> from Bio import SearchIO | |
| >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! | |
| >>> db_idx = SearchIO.index_db(idx_filename, 'Blast/mirna.xml', 'blast-xml') | |
| >>> sorted(db_idx) | |
| ['33211', '33212', '33213'] | |
| >>> db_idx['33212'] | |
| QueryResult(id='33212', 44 hits) | |
| >>> db_idx.close() | |
| ``index_db`` can also index multiple files and store them in the same | |
| database, making it easier to group multiple search files and access them | |
| from a single interface. | |
| >>> from Bio import SearchIO | |
| >>> idx_filename = ":memory:" # Use a real filename, this is in RAM only! | |
| >>> files = ['Blast/mirna.xml', 'Blast/wnts.xml'] | |
| >>> db_idx = SearchIO.index_db(idx_filename, files, 'blast-xml') | |
| >>> sorted(db_idx) | |
| ['33211', '33212', '33213', 'gi|156630997:105-1160', ..., 'gi|53729353:216-1313'] | |
| >>> db_idx['33212'] | |
| QueryResult(id='33212', 44 hits) | |
| >>> db_idx.close() | |
| One common example where this is helpful is if you had a large set of | |
| query sequences (say ten thousand) which you split into ten query files | |
| of one thousand sequences each in order to run as ten separate BLAST jobs | |
| on a cluster. You could use ``index_db`` to index the ten BLAST output | |
| files together for seamless access to all the results as one dictionary. | |
| Note that ':memory:' rather than an index filename tells SQLite to hold | |
| the index database in memory. This is useful for quick tests, but using | |
| the Bio.SearchIO.index(...) function instead would use less memory. | |
| BGZF compressed files are supported, and detected automatically. Ordinary | |
| GZIP compressed files are not supported. | |
| See also Bio.SearchIO.index(), Bio.SearchIO.to_dict(), and the Python module | |
| glob which is useful for building lists of files. | |
| """ | |
| # cast filenames to list if it's a string | |
| # (can we check if it's a string or a generator?) | |
| if isinstance(filenames, str): | |
| filenames = [filenames] | |
| from Bio.File import _SQLiteManySeqFilesDict | |
| repr = f"SearchIO.index_db({index_filename!r}, filenames={filenames!r}, {format!r}, key_function={key_function!r})" | |
| def proxy_factory(format, filename=None): | |
| """Given a filename returns proxy object, else boolean if format OK.""" | |
| if filename: | |
| return get_processor(format, _INDEXER_MAP)(filename, **kwargs) | |
| else: | |
| return format in _INDEXER_MAP | |
| return _SQLiteManySeqFilesDict( | |
| index_filename, filenames, proxy_factory, format, key_function, repr | |
| ) | |
| def write(qresults, handle, format=None, **kwargs): | |
| """Write QueryResult objects to a file in the given format. | |
| - qresults - An iterator returning QueryResult objects or a single | |
| QueryResult object. | |
| - handle - Handle to the file, or the filename as a string. | |
| - format - Lower case string denoting one of the supported formats. | |
| - kwargs - Format-specific keyword arguments. | |
| The ``write`` function writes QueryResult object(s) into the given output | |
| handle / filename. You can supply it with a single QueryResult object or an | |
| iterable returning one or more QueryResult objects. In both cases, the | |
| function will return a tuple of four values: the number of QueryResult, Hit, | |
| HSP, and HSPFragment objects it writes to the output file:: | |
| from Bio import SearchIO | |
| qresults = SearchIO.parse('Blast/mirna.xml', 'blast-xml') | |
| SearchIO.write(qresults, 'results.tab', 'blast-tab') | |
| <stdout> (3, 239, 277, 277) | |
| The output of different formats may be adjusted using the format-specific | |
| keyword arguments. Here is an example that writes BLAT PSL output file with | |
| a header:: | |
| from Bio import SearchIO | |
| qresults = SearchIO.parse('Blat/psl_34_001.psl', 'blat-psl') | |
| SearchIO.write(qresults, 'results.tab', 'blat-psl', header=True) | |
| <stdout> (2, 13, 22, 26) | |
| """ | |
| # turn qresults into an iterator if it's a single QueryResult object | |
| if isinstance(qresults, QueryResult): | |
| qresults = iter([qresults]) | |
| else: | |
| qresults = iter(qresults) | |
| # get the writer object and do error checking | |
| writer_class = get_processor(format, _WRITER_MAP) | |
| # write to the handle | |
| with as_handle(handle, "w") as target_file: | |
| writer = writer_class(target_file, **kwargs) | |
| # count how many qresults, hits, and hsps | |
| qresult_count, hit_count, hsp_count, frag_count = writer.write_file(qresults) | |
| return qresult_count, hit_count, hsp_count, frag_count | |
| def convert(in_file, in_format, out_file, out_format, in_kwargs=None, out_kwargs=None): | |
| """Convert between two search output formats, return number of records. | |
| - in_file - Handle to the input file, or the filename as string. | |
| - in_format - Lower case string denoting the format of the input file. | |
| - out_file - Handle to the output file, or the filename as string. | |
| - out_format - Lower case string denoting the format of the output file. | |
| - in_kwargs - Dictionary of keyword arguments for the input function. | |
| - out_kwargs - Dictionary of keyword arguments for the output function. | |
| The convert function is a shortcut function for ``parse`` and ``write``. It has | |
| the same return type as ``write``. Format-specific arguments may be passed to | |
| the convert function, but only as dictionaries. | |
| Here is an example of using ``convert`` to convert from a BLAST+ XML file | |
| into a tabular file with comments:: | |
| from Bio import SearchIO | |
| in_file = 'Blast/mirna.xml' | |
| in_fmt = 'blast-xml' | |
| out_file = 'results.tab' | |
| out_fmt = 'blast-tab' | |
| out_kwarg = {'comments': True} | |
| SearchIO.convert(in_file, in_fmt, out_file, out_fmt, out_kwargs=out_kwarg) | |
| <stdout> (3, 239, 277, 277) | |
| Given that different search output file provide different statistics and | |
| different level of details, the convert function is limited only to | |
| converting formats that have the same statistics and for conversion to | |
| formats with the same level of detail, or less. | |
| For example, converting from a BLAST+ XML output to a HMMER table file | |
| is not possible, as these are two search programs with different kinds of | |
| statistics. In theory, you may provide the necessary values required by the | |
| HMMER table file (e.g. conditional e-values, envelope coordinates, etc). | |
| However, these values are likely to hold little meaning as they are not true | |
| HMMER-computed values. | |
| Another example is converting from BLAST+ XML to BLAST+ tabular file. This | |
| is possible, as BLAST+ XML provide all the values necessary to create a | |
| BLAST+ tabular file. However, the reverse conversion may not be possible. | |
| There are more details covered in the XML file that are not found in a | |
| tabular file (e.g. the lambda and kappa values) | |
| """ | |
| if in_kwargs is None: | |
| in_kwargs = {} | |
| if out_kwargs is None: | |
| out_kwargs = {} | |
| qresults = parse(in_file, in_format, **in_kwargs) | |
| return write(qresults, out_file, out_format, **out_kwargs) | |
| # if not used as a module, run the doctest | |
| if __name__ == "__main__": | |
| from Bio._utils import run_doctest | |
| run_doctest() | |