Spaces:
No application file
No application file
| # Copyright 2008-2018 by Peter Cock. All rights reserved. | |
| # | |
| # This file is part of the Biopython distribution and governed by your | |
| # choice of the "Biopython License Agreement" or the "BSD 3-Clause License". | |
| # Please see the LICENSE file that should have been included as part of this | |
| # package. | |
| """Multiple sequence alignment input/output as alignment objects. | |
| The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in | |
| fact the two are connected internally. Both modules use the same set of file | |
| format names (lower case strings). From the user's perspective, you can read | |
| in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you | |
| can read in the sequences within these alignments using Bio.SeqIO. | |
| Bio.AlignIO is also documented at http://biopython.org/wiki/AlignIO and by | |
| a whole chapter in our tutorial: | |
| * `HTML Tutorial`_ | |
| * `PDF Tutorial`_ | |
| .. _`HTML Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.html | |
| .. _`PDF Tutorial`: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf | |
| Input | |
| ----- | |
| For the typical special case when your file or handle contains one and only | |
| one alignment, use the function Bio.AlignIO.read(). This takes an input file | |
| handle (or in recent versions of Biopython a filename as a string), format | |
| string and optional number of sequences per alignment. It will return a single | |
| MultipleSeqAlignment object (or raise an exception if there isn't just one | |
| alignment): | |
| from Bio import AlignIO | |
| align = AlignIO.read("Phylip/interlaced.phy", "phylip") | |
| print(align) | |
| Alignment with 3 rows and 384 columns | |
| -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI | |
| MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU | |
| ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN | |
| For the general case, when the handle could contain any number of alignments, | |
| use the function Bio.AlignIO.parse(...) which takes the same arguments, but | |
| returns an iterator giving MultipleSeqAlignment objects (typically used in a | |
| for loop). If you want random access to the alignments by number, turn this | |
| into a list: | |
| from Bio import AlignIO | |
| alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss")) | |
| print(alignments[2]) | |
| Alignment with 2 rows and 120 columns | |
| -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec | |
| LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver | |
| Most alignment file formats can be concatenated so as to hold as many | |
| different multiple sequence alignments as possible. One common example | |
| is the output of the tool seqboot in the PHLYIP suite. Sometimes there | |
| can be a file header and footer, as seen in the EMBOSS alignment output. | |
| Output | |
| ------ | |
| Use the function Bio.AlignIO.write(...), which takes a complete set of | |
| Alignment objects (either as a list, or an iterator), an output file handle | |
| (or filename in recent versions of Biopython) and of course the file format:: | |
| from Bio import AlignIO | |
| alignments = ... | |
| count = SeqIO.write(alignments, "example.faa", "fasta") | |
| If using a handle make sure to close it to flush the data to the disk:: | |
| from Bio import AlignIO | |
| alignments = ... | |
| with open("example.faa", "w") as handle: | |
| count = SeqIO.write(alignments, handle, "fasta") | |
| In general, you are expected to call this function once (with all your | |
| alignments) and then close the file handle. However, for file formats | |
| like PHYLIP where multiple alignments are stored sequentially (with no file | |
| header and footer), then multiple calls to the write function should work as | |
| expected when using handles. | |
| If you are using a filename, the repeated calls to the write functions will | |
| overwrite the existing file each time. | |
| Conversion | |
| ---------- | |
| The Bio.AlignIO.convert(...) function allows an easy interface for simple | |
| alignment file format conversions. Additionally, it may use file format | |
| specific optimisations so this should be the fastest way too. | |
| In general however, you can combine the Bio.AlignIO.parse(...) function with | |
| the Bio.AlignIO.write(...) function for sequence file conversion. Using | |
| generator expressions provides a memory efficient way to perform filtering or | |
| other extra operations as part of the process. | |
| File Formats | |
| ------------ | |
| When specifying the file format, use lowercase strings. The same format | |
| names are also used in Bio.SeqIO and include the following: | |
| - clustal - Output from Clustal W or X. | |
| - emboss - EMBOSS tools' "pairs" and "simple" alignment formats. | |
| - fasta - The generic sequence file format where each record starts with | |
| an identifier line starting with a ">" character, followed by | |
| lines of sequence. | |
| - fasta-m10 - For the pairwise alignments output by Bill Pearson's FASTA | |
| tools when used with the -m 10 command line option for machine | |
| readable output. | |
| - ig - The IntelliGenetics file format, apparently the same as the | |
| MASE alignment format. | |
| - msf - The GCG MSF alignment format, originally from PileUp tool. | |
| - nexus - Output from NEXUS, see also the module Bio.Nexus which can also | |
| read any phylogenetic trees in these files. | |
| - phylip - Interlaced PHYLIP, as used by the PHYLIP tools. | |
| - phylip-sequential - Sequential PHYLIP. | |
| - phylip-relaxed - PHYLIP like format allowing longer names. | |
| - stockholm - A richly annotated alignment file format used by PFAM. | |
| - mauve - Output from progressiveMauve/Mauve | |
| Note that while Bio.AlignIO can read all the above file formats, it cannot | |
| write to all of them. | |
| You can also use any file format supported by Bio.SeqIO, such as "fasta" or | |
| "ig" (which are listed above), PROVIDED the sequences in your file are all the | |
| same length. | |
| """ | |
| # TODO | |
| # - define policy on reading aligned sequences with gaps in | |
| # (e.g. - and . characters) | |
| # | |
| # - Can we build the to_alignment(...) functionality | |
| # into the generic Alignment class instead? | |
| # | |
| # - How best to handle unique/non unique record.id when writing. | |
| # For most file formats reading such files is fine; The stockholm | |
| # parser would fail. | |
| # | |
| # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) | |
| # http://www.bioperl.org/wiki/MSF_multiple_alignment_format | |
| from Bio.Align import MultipleSeqAlignment | |
| from Bio.File import as_handle | |
| from . import ClustalIO | |
| from . import EmbossIO | |
| from . import FastaIO | |
| from . import MafIO | |
| from . import MauveIO | |
| from . import MsfIO | |
| from . import NexusIO | |
| from . import PhylipIO | |
| from . import StockholmIO | |
| # Convention for format names is "mainname-subtype" in lower case. | |
| # Please use the same names as BioPerl and EMBOSS where possible. | |
| _FormatToIterator = { # "fasta" is done via Bio.SeqIO | |
| "clustal": ClustalIO.ClustalIterator, | |
| "emboss": EmbossIO.EmbossIterator, | |
| "fasta-m10": FastaIO.FastaM10Iterator, | |
| "maf": MafIO.MafIterator, | |
| "mauve": MauveIO.MauveIterator, | |
| "msf": MsfIO.MsfIterator, | |
| "nexus": NexusIO.NexusIterator, | |
| "phylip": PhylipIO.PhylipIterator, | |
| "phylip-sequential": PhylipIO.SequentialPhylipIterator, | |
| "phylip-relaxed": PhylipIO.RelaxedPhylipIterator, | |
| "stockholm": StockholmIO.StockholmIterator, | |
| } | |
| _FormatToWriter = { # "fasta" is done via Bio.SeqIO | |
| "clustal": ClustalIO.ClustalWriter, | |
| "maf": MafIO.MafWriter, | |
| "mauve": MauveIO.MauveWriter, | |
| "nexus": NexusIO.NexusWriter, | |
| "phylip": PhylipIO.PhylipWriter, | |
| "phylip-sequential": PhylipIO.SequentialPhylipWriter, | |
| "phylip-relaxed": PhylipIO.RelaxedPhylipWriter, | |
| "stockholm": StockholmIO.StockholmWriter, | |
| } | |
| def write(alignments, handle, format): | |
| """Write complete set of alignments to a file. | |
| Arguments: | |
| - alignments - A list (or iterator) of MultipleSeqAlignment objects, | |
| or a single alignment object. | |
| - handle - File handle object to write to, or filename as string | |
| (note older versions of Biopython only took a handle). | |
| - format - lower case string describing the file format to write. | |
| You should close the handle after calling this function. | |
| Returns the number of alignments written (as an integer). | |
| """ | |
| from Bio import SeqIO | |
| # Try and give helpful error messages: | |
| if not isinstance(format, str): | |
| raise TypeError("Need a string for the file format (lower case)") | |
| if not format: | |
| raise ValueError("Format required (lower case string)") | |
| if format != format.lower(): | |
| raise ValueError(f"Format string '{format}' should be lower case") | |
| if isinstance(alignments, MultipleSeqAlignment): | |
| # This raised an exception in older versions of Biopython | |
| alignments = [alignments] | |
| with as_handle(handle, "w") as fp: | |
| # Map the file format to a writer class | |
| if format in _FormatToWriter: | |
| writer_class = _FormatToWriter[format] | |
| count = writer_class(fp).write_file(alignments) | |
| elif format in SeqIO._FormatToWriter: | |
| # Exploit the existing SeqIO parser to do the dirty work! | |
| # TODO - Can we make one call to SeqIO.write() and count the alignments? | |
| count = 0 | |
| for alignment in alignments: | |
| if not isinstance(alignment, MultipleSeqAlignment): | |
| raise TypeError( | |
| "Expect a list or iterator of MultipleSeqAlignment " | |
| "objects, got: %r" % alignment | |
| ) | |
| SeqIO.write(alignment, fp, format) | |
| count += 1 | |
| elif format in _FormatToIterator or format in SeqIO._FormatToIterator: | |
| raise ValueError(f"Reading format '{format}' is supported, but not writing") | |
| else: | |
| raise ValueError(f"Unknown format '{format}'") | |
| if not isinstance(count, int): | |
| raise RuntimeError( | |
| "Internal error - the underlying %s " | |
| "writer should have returned the alignment count, not %r" % (format, count) | |
| ) | |
| return count | |
| # This is a generator function! | |
| def _SeqIO_to_alignment_iterator(handle, format, seq_count=None): | |
| """Use Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE). | |
| Arguments: | |
| - handle - handle to the file. | |
| - format - string describing the file format. | |
| - seq_count - Optional integer, number of sequences expected in each | |
| alignment. Recommended for fasta format files. | |
| If count is omitted (default) then all the sequences in the file are | |
| combined into a single MultipleSeqAlignment. | |
| """ | |
| from Bio import SeqIO | |
| if format not in SeqIO._FormatToIterator: | |
| raise ValueError(f"Unknown format '{format}'") | |
| if seq_count: | |
| # Use the count to split the records into batches. | |
| seq_record_iterator = SeqIO.parse(handle, format) | |
| records = [] | |
| for record in seq_record_iterator: | |
| records.append(record) | |
| if len(records) == seq_count: | |
| yield MultipleSeqAlignment(records) | |
| records = [] | |
| if records: | |
| raise ValueError("Check seq_count argument, not enough sequences?") | |
| else: | |
| # Must assume that there is a single alignment using all | |
| # the SeqRecord objects: | |
| records = list(SeqIO.parse(handle, format)) | |
| if records: | |
| yield MultipleSeqAlignment(records) | |
| def parse(handle, format, seq_count=None): | |
| """Iterate over an alignment file as MultipleSeqAlignment objects. | |
| Arguments: | |
| - handle - handle to the file, or the filename as a string | |
| (note older versions of Biopython only took a handle). | |
| - format - string describing the file format. | |
| - seq_count - Optional integer, number of sequences expected in each | |
| alignment. Recommended for fasta format files. | |
| If you have the file name in a string 'filename', use: | |
| >>> from Bio import AlignIO | |
| >>> filename = "Emboss/needle.txt" | |
| >>> format = "emboss" | |
| >>> for alignment in AlignIO.parse(filename, format): | |
| ... print("Alignment of length %i" % alignment.get_alignment_length()) | |
| Alignment of length 124 | |
| Alignment of length 119 | |
| Alignment of length 120 | |
| Alignment of length 118 | |
| Alignment of length 125 | |
| If you have a string 'data' containing the file contents, use:: | |
| from Bio import AlignIO | |
| from io import StringIO | |
| my_iterator = AlignIO.parse(StringIO(data), format) | |
| Use the Bio.AlignIO.read() function when you expect a single record only. | |
| """ | |
| from Bio import SeqIO | |
| # Try and give helpful error messages: | |
| if not isinstance(format, str): | |
| raise TypeError("Need a string for the file format (lower case)") | |
| if not format: | |
| raise ValueError("Format required (lower case string)") | |
| if format != format.lower(): | |
| raise ValueError(f"Format string '{format}' should be lower case") | |
| if seq_count is not None and not isinstance(seq_count, int): | |
| raise TypeError("Need integer for seq_count (sequences per alignment)") | |
| with as_handle(handle) as fp: | |
| # Map the file format to a sequence iterator: | |
| if format in _FormatToIterator: | |
| iterator_generator = _FormatToIterator[format] | |
| i = iterator_generator(fp, seq_count) | |
| elif format in SeqIO._FormatToIterator: | |
| # Exploit the existing SeqIO parser to the dirty work! | |
| i = _SeqIO_to_alignment_iterator(fp, format, seq_count=seq_count) | |
| else: | |
| raise ValueError(f"Unknown format '{format}'") | |
| yield from i | |
| def read(handle, format, seq_count=None): | |
| """Turn an alignment file into a single MultipleSeqAlignment object. | |
| Arguments: | |
| - handle - handle to the file, or the filename as a string | |
| (note older versions of Biopython only took a handle). | |
| - format - string describing the file format. | |
| - seq_count - Optional integer, number of sequences expected in each | |
| alignment. Recommended for fasta format files. | |
| If the handle contains no alignments, or more than one alignment, | |
| an exception is raised. For example, using a PFAM/Stockholm file | |
| containing one alignment: | |
| >>> from Bio import AlignIO | |
| >>> filename = "Clustalw/protein.aln" | |
| >>> format = "clustal" | |
| >>> alignment = AlignIO.read(filename, format) | |
| >>> print("Alignment of length %i" % alignment.get_alignment_length()) | |
| Alignment of length 411 | |
| If however you want the first alignment from a file containing | |
| multiple alignments this function would raise an exception. | |
| >>> from Bio import AlignIO | |
| >>> filename = "Emboss/needle.txt" | |
| >>> format = "emboss" | |
| >>> alignment = AlignIO.read(filename, format) | |
| Traceback (most recent call last): | |
| ... | |
| ValueError: More than one record found in handle | |
| Instead use: | |
| >>> from Bio import AlignIO | |
| >>> filename = "Emboss/needle.txt" | |
| >>> format = "emboss" | |
| >>> alignment = next(AlignIO.parse(filename, format)) | |
| >>> print("First alignment has length %i" % alignment.get_alignment_length()) | |
| First alignment has length 124 | |
| You must use the Bio.AlignIO.parse() function if you want to read multiple | |
| records from the handle. | |
| """ | |
| iterator = parse(handle, format, seq_count) | |
| try: | |
| alignment = next(iterator) | |
| except StopIteration: | |
| raise ValueError("No records found in handle") from None | |
| try: | |
| next(iterator) | |
| raise ValueError("More than one record found in handle") | |
| except StopIteration: | |
| pass | |
| if seq_count: | |
| if len(alignment) != seq_count: | |
| raise RuntimeError( | |
| "More sequences found in alignment than specified in seq_count: %s." | |
| % seq_count | |
| ) | |
| return alignment | |
| def convert(in_file, in_format, out_file, out_format, molecule_type=None): | |
| """Convert between two alignment files, returns number of alignments. | |
| Arguments: | |
| - in_file - an input handle or filename | |
| - in_format - input file format, lower case string | |
| - output - an output handle or filename | |
| - out_file - output file format, lower case string | |
| - molecule_type - optional molecule type to apply, string containing | |
| "DNA", "RNA" or "protein". | |
| **NOTE** - If you provide an output filename, it will be opened which will | |
| overwrite any existing file without warning. This may happen if even the | |
| conversion is aborted (e.g. an invalid out_format name is given). | |
| Some output formats require the molecule type be specified where this | |
| cannot be determined by the parser. For example, converting to FASTA, | |
| Clustal, or PHYLIP format to NEXUS: | |
| >>> from io import StringIO | |
| >>> from Bio import AlignIO | |
| >>> handle = StringIO() | |
| >>> AlignIO.convert("Phylip/horses.phy", "phylip", handle, "nexus", "DNA") | |
| 1 | |
| >>> print(handle.getvalue()) | |
| #NEXUS | |
| begin data; | |
| dimensions ntax=10 nchar=40; | |
| format datatype=dna missing=? gap=-; | |
| matrix | |
| Mesohippus AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA | |
| Hypohippus AAACCCCCCCAAAAAAAAACAAAAAAAAAAAAAAAAAAAA | |
| Archaeohip CAAAAAAAAAAAAAAAACACAAAAAAAAAAAAAAAAAAAA | |
| Parahippus CAAACAACAACAAAAAAAACAAAAAAAAAAAAAAAAAAAA | |
| Merychippu CCAACCACCACCCCACACCCAAAAAAAAAAAAAAAAAAAA | |
| 'M. secundu' CCAACCACCACCCACACCCCAAAAAAAAAAAAAAAAAAAA | |
| Nannipus CCAACCACAACCCCACACCCAAAAAAAAAAAAAAAAAAAA | |
| Neohippari CCAACCCCCCCCCCACACCCAAAAAAAAAAAAAAAAAAAA | |
| Calippus CCAACCACAACCCACACCCCAAAAAAAAAAAAAAAAAAAA | |
| Pliohippus CCCACCCCCCCCCACACCCCAAAAAAAAAAAAAAAAAAAA | |
| ; | |
| end; | |
| <BLANKLINE> | |
| """ | |
| if molecule_type: | |
| if not isinstance(molecule_type, str): | |
| raise TypeError(f"Molecule type should be a string, not {molecule_type!r}") | |
| elif ( | |
| "DNA" in molecule_type | |
| or "RNA" in molecule_type | |
| or "protein" in molecule_type | |
| ): | |
| pass | |
| else: | |
| raise ValueError(f"Unexpected molecule type, {molecule_type!r}") | |
| # TODO - Add optimised versions of important conversions | |
| # For now just off load the work to SeqIO parse/write | |
| # Don't open the output file until we've checked the input is OK: | |
| alignments = parse(in_file, in_format, None) | |
| if molecule_type: | |
| # Edit the records on the fly to set molecule type | |
| def over_ride(alignment): | |
| """Over-ride molecule in-place.""" | |
| for record in alignment: | |
| record.annotations["molecule_type"] = molecule_type | |
| return alignment | |
| alignments = (over_ride(_) for _ in alignments) | |
| return write(alignments, out_file, out_format) | |
| if __name__ == "__main__": | |
| from Bio._utils import run_doctest | |
| run_doctest() | |