tmp
/
pip-install-ghxuqwgs
/numpy_78e94bf2b6094bf9a1f3d92042f9bf46
/doc
/source
/user
/basics.io.genfromtxt.rst
| .. sectionauthor:: Pierre Gerard-Marchant <pierregmcode@gmail.com> | |
| ********************************************* | |
| Importing data with :func:`~numpy.genfromtxt` | |
| ********************************************* | |
| Numpy provides several functions to create arrays from tabular data. | |
| We focus here on the :func:`~numpy.genfromtxt` function. | |
| In a nutshell, :func:`~numpy.genfromtxt` runs two main loops. The first | |
| loop converts each line of the file in a sequence of strings. The second | |
| loop converts each string to the appropriate data type. This mechanism is | |
| slower than a single loop, but gives more flexibility. In particular, | |
| :func:`~numpy.genfromtxt` is able to take missing data into account, when | |
| other faster and simpler functions like :func:`~numpy.loadtxt` cannot. | |
| .. note:: | |
| When giving examples, we will use the following conventions:: | |
| >>> import numpy as np | |
| >>> from StringIO import StringIO | |
| Defining the input | |
| ================== | |
| The only mandatory argument of :func:`~numpy.genfromtxt` is the source of | |
| the data. It can be a string corresponding to the name of a local or | |
| remote file, or a file-like object with a :meth:`read` method (such as an | |
| actual file or a :class:`StringIO.StringIO` object). If the argument is | |
| the URL of a remote file, this latter is automatically downloaded in the | |
| current directory. | |
| The input file can be a text file or an archive. Currently, the function | |
| recognizes :class:`gzip` and :class:`bz2` (`bzip2`) archives. The type of | |
| the archive is determined by examining the extension of the file: if the | |
| filename ends with ``'.gz'``, a :class:`gzip` archive is expected; if it | |
| ends with ``'bz2'``, a :class:`bzip2` archive is assumed. | |
| Splitting the lines into columns | |
| ================================ | |
| The :keyword:`delimiter` argument | |
| --------------------------------- | |
| Once the file is defined and open for reading, :func:`~numpy.genfromtxt` | |
| splits each non-empty line into a sequence of strings. Empty or commented | |
| lines are just skipped. The :keyword:`delimiter` keyword is used to define | |
| how the splitting should take place. | |
| Quite often, a single character marks the separation between columns. For | |
| example, comma-separated files (CSV) use a comma (``,``) or a semicolon | |
| (``;``) as delimiter:: | |
| >>> data = "1, 2, 3\n4, 5, 6" | |
| >>> np.genfromtxt(StringIO(data), delimiter=",") | |
| array([[ 1., 2., 3.], | |
| [ 4., 5., 6.]]) | |
| Another common separator is ``"\t"``, the tabulation character. However, | |
| we are not limited to a single character, any string will do. By default, | |
| :func:`~numpy.genfromtxt` assumes ``delimiter=None``, meaning that the line | |
| is split along white spaces (including tabs) and that consecutive white | |
| spaces are considered as a single white space. | |
| Alternatively, we may be dealing with a fixed-width file, where columns are | |
| defined as a given number of characters. In that case, we need to set | |
| :keyword:`delimiter` to a single integer (if all the columns have the same | |
| size) or to a sequence of integers (if columns can have different sizes):: | |
| >>> data = " 1 2 3\n 4 5 67\n890123 4" | |
| >>> np.genfromtxt(StringIO(data), delimiter=3) | |
| array([[ 1., 2., 3.], | |
| [ 4., 5., 67.], | |
| [ 890., 123., 4.]]) | |
| >>> data = "123456789\n 4 7 9\n 4567 9" | |
| >>> np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)) | |
| array([[ 1234., 567., 89.], | |
| [ 4., 7., 9.], | |
| [ 4., 567., 9.]]) | |
| The :keyword:`autostrip` argument | |
| --------------------------------- | |
| By default, when a line is decomposed into a series of strings, the | |
| individual entries are not stripped of leading nor trailing white spaces. | |
| This behavior can be overwritten by setting the optional argument | |
| :keyword:`autostrip` to a value of ``True``:: | |
| >>> data = "1, abc , 2\n 3, xxx, 4" | |
| >>> # Without autostrip | |
| >>> np.genfromtxt(StringIO(data), dtype="|S5") | |
| array([['1', ' abc ', ' 2'], | |
| ['3', ' xxx', ' 4']], | |
| dtype='|S5') | |
| >>> # With autostrip | |
| >>> np.genfromtxt(StringIO(data), dtype="|S5", autostrip=True) | |
| array([['1', 'abc', '2'], | |
| ['3', 'xxx', '4']], | |
| dtype='|S5') | |
| The :keyword:`comments` argument | |
| -------------------------------- | |
| The optional argument :keyword:`comments` is used to define a character | |
| string that marks the beginning of a comment. By default, | |
| :func:`~numpy.genfromtxt` assumes ``comments='#'``. The comment marker may | |
| occur anywhere on the line. Any character present after the comment | |
| marker(s) is simply ignored:: | |
| >>> data = """# | |
| ... # Skip me ! | |
| ... # Skip me too ! | |
| ... 1, 2 | |
| ... 3, 4 | |
| ... 5, 6 #This is the third line of the data | |
| ... 7, 8 | |
| ... # And here comes the last line | |
| ... 9, 0 | |
| ... """ | |
| >>> np.genfromtxt(StringIO(data), comments="#", delimiter=",") | |
| [[ 1. 2.] | |
| [ 3. 4.] | |
| [ 5. 6.] | |
| [ 7. 8.] | |
| [ 9. 0.]] | |
| .. note:: | |
| There is one notable exception to this behavior: if the optional argument | |
| ``names=True``, the first commented line will be examined for names. | |
| Skipping lines and choosing columns | |
| =================================== | |
| The :keyword:`skip_header` and :keyword:`skip_footer` arguments | |
| --------------------------------------------------------------- | |
| The presence of a header in the file can hinder data processing. In that | |
| case, we need to use the :keyword:`skip_header` optional argument. The | |
| values of this argument must be an integer which corresponds to the number | |
| of lines to skip at the beginning of the file, before any other action is | |
| performed. Similarly, we can skip the last ``n`` lines of the file by | |
| using the :keyword:`skip_footer` attribute and giving it a value of ``n``:: | |
| >>> data = "\n".join(str(i) for i in range(10)) | |
| >>> np.genfromtxt(StringIO(data),) | |
| array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]) | |
| >>> np.genfromtxt(StringIO(data), | |
| ... skip_header=3, skip_footer=5) | |
| array([ 3., 4.]) | |
| By default, ``skip_header=0`` and ``skip_footer=0``, meaning that no lines | |
| are skipped. | |
| The :keyword:`usecols` argument | |
| ------------------------------- | |
| In some cases, we are not interested in all the columns of the data but | |
| only a few of them. We can select which columns to import with the | |
| :keyword:`usecols` argument. This argument accepts a single integer or a | |
| sequence of integers corresponding to the indices of the columns to import. | |
| Remember that by convention, the first column has an index of 0. Negative | |
| integers behave the same as regular Python negative indexes. | |
| For example, if we want to import only the first and the last columns, we | |
| can use ``usecols=(0, -1)``:: | |
| >>> data = "1 2 3\n4 5 6" | |
| >>> np.genfromtxt(StringIO(data), usecols=(0, -1)) | |
| array([[ 1., 3.], | |
| [ 4., 6.]]) | |
| If the columns have names, we can also select which columns to import by | |
| giving their name to the :keyword:`usecols` argument, either as a sequence | |
| of strings or a comma-separated string:: | |
| >>> data = "1 2 3\n4 5 6" | |
| >>> np.genfromtxt(StringIO(data), | |
| ... names="a, b, c", usecols=("a", "c")) | |
| array([(1.0, 3.0), (4.0, 6.0)], | |
| dtype=[('a', '<f8'), ('c', '<f8')]) | |
| >>> np.genfromtxt(StringIO(data), | |
| ... names="a, b, c", usecols=("a, c")) | |
| array([(1.0, 3.0), (4.0, 6.0)], | |
| dtype=[('a', '<f8'), ('c', '<f8')]) | |
| Choosing the data type | |
| ====================== | |
| The main way to control how the sequences of strings we have read from the | |
| file are converted to other types is to set the :keyword:`dtype` argument. | |
| Acceptable values for this argument are: | |
| * a single type, such as ``dtype=float``. | |
| The output will be 2D with the given dtype, unless a name has been | |
| associated with each column with the use of the :keyword:`names` argument | |
| (see below). Note that ``dtype=float`` is the default for | |
| :func:`~numpy.genfromtxt`. | |
| * a sequence of types, such as ``dtype=(int, float, float)``. | |
| * a comma-separated string, such as ``dtype="i4,f8,|S3"``. | |
| * a dictionary with two keys ``'names'`` and ``'formats'``. | |
| * a sequence of tuples ``(name, type)``, such as | |
| ``dtype=[('A', int), ('B', float)]``. | |
| * an existing :class:`numpy.dtype` object. | |
| * the special value ``None``. | |
| In that case, the type of the columns will be determined from the data | |
| itself (see below). | |
| In all the cases but the first one, the output will be a 1D array with a | |
| structured dtype. This dtype has as many fields as items in the sequence. | |
| The field names are defined with the :keyword:`names` keyword. | |
| When ``dtype=None``, the type of each column is determined iteratively from | |
| its data. We start by checking whether a string can be converted to a | |
| boolean (that is, if the string matches ``true`` or ``false`` in lower | |
| cases); then whether it can be converted to an integer, then to a float, | |
| then to a complex and eventually to a string. This behavior may be changed | |
| by modifying the default mapper of the | |
| :class:`~numpy.lib._iotools.StringConverter` class. | |
| The option ``dtype=None`` is provided for convenience. However, it is | |
| significantly slower than setting the dtype explicitly. | |
| Setting the names | |
| ================= | |
| The :keyword:`names` argument | |
| ----------------------------- | |
| A natural approach when dealing with tabular data is to allocate a name to | |
| each column. A first possibility is to use an explicit structured dtype, | |
| as mentioned previously:: | |
| >>> data = StringIO("1 2 3\n 4 5 6") | |
| >>> np.genfromtxt(data, dtype=[(_, int) for _ in "abc"]) | |
| array([(1, 2, 3), (4, 5, 6)], | |
| dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) | |
| Another simpler possibility is to use the :keyword:`names` keyword with a | |
| sequence of strings or a comma-separated string:: | |
| >>> data = StringIO("1 2 3\n 4 5 6") | |
| >>> np.genfromtxt(data, names="A, B, C") | |
| array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], | |
| dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')]) | |
| In the example above, we used the fact that by default, ``dtype=float``. | |
| By giving a sequence of names, we are forcing the output to a structured | |
| dtype. | |
| We may sometimes need to define the column names from the data itself. In | |
| that case, we must use the :keyword:`names` keyword with a value of | |
| ``True``. The names will then be read from the first line (after the | |
| ``skip_header`` ones), even if the line is commented out:: | |
| >>> data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6") | |
| >>> np.genfromtxt(data, skip_header=1, names=True) | |
| array([(1.0, 2.0, 3.0), (4.0, 5.0, 6.0)], | |
| dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')]) | |
| The default value of :keyword:`names` is ``None``. If we give any other | |
| value to the keyword, the new names will overwrite the field names we may | |
| have defined with the dtype:: | |
| >>> data = StringIO("1 2 3\n 4 5 6") | |
| >>> ndtype=[('a',int), ('b', float), ('c', int)] | |
| >>> names = ["A", "B", "C"] | |
| >>> np.genfromtxt(data, names=names, dtype=ndtype) | |
| array([(1, 2.0, 3), (4, 5.0, 6)], | |
| dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')]) | |
| The :keyword:`defaultfmt` argument | |
| ---------------------------------- | |
| If ``names=None`` but a structured dtype is expected, names are defined | |
| with the standard NumPy default of ``"f%i"``, yielding names like ``f0``, | |
| ``f1`` and so forth:: | |
| >>> data = StringIO("1 2 3\n 4 5 6") | |
| >>> np.genfromtxt(data, dtype=(int, float, int)) | |
| array([(1, 2.0, 3), (4, 5.0, 6)], | |
| dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')]) | |
| In the same way, if we don't give enough names to match the length of the | |
| dtype, the missing names will be defined with this default template:: | |
| >>> data = StringIO("1 2 3\n 4 5 6") | |
| >>> np.genfromtxt(data, dtype=(int, float, int), names="a") | |
| array([(1, 2.0, 3), (4, 5.0, 6)], | |
| dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')]) | |
| We can overwrite this default with the :keyword:`defaultfmt` argument, that | |
| takes any format string:: | |
| >>> data = StringIO("1 2 3\n 4 5 6") | |
| >>> np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i") | |
| array([(1, 2.0, 3), (4, 5.0, 6)], | |
| dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')]) | |
| .. note:: | |
| We need to keep in mind that ``defaultfmt`` is used only if some names | |
| are expected but not defined. | |
| Validating names | |
| ---------------- | |
| Numpy arrays with a structured dtype can also be viewed as | |
| :class:`~numpy.recarray`, where a field can be accessed as if it were an | |
| attribute. For that reason, we may need to make sure that the field name | |
| doesn't contain any space or invalid character, or that it does not | |
| correspond to the name of a standard attribute (like ``size`` or | |
| ``shape``), which would confuse the interpreter. :func:`~numpy.genfromtxt` | |
| accepts three optional arguments that provide a finer control on the names: | |
| :keyword:`deletechars` | |
| Gives a string combining all the characters that must be deleted from | |
| the name. By default, invalid characters are | |
| ``~!@#$%^&*()-=+~\|]}[{';: | |
| /?.>,<``. | |
| :keyword:`excludelist` | |
| Gives a list of the names to exclude, such as ``return``, ``file``, | |
| ``print``... If one of the input name is part of this list, an | |
| underscore character (``'_'``) will be appended to it. | |
| :keyword:`case_sensitive` | |
| Whether the names should be case-sensitive (``case_sensitive=True``), | |
| converted to upper case (``case_sensitive=False`` or | |
| ``case_sensitive='upper'``) or to lower case | |
| (``case_sensitive='lower'``). | |
| Tweaking the conversion | |
| ======================= | |
| The :keyword:`converters` argument | |
| ---------------------------------- | |
| Usually, defining a dtype is sufficient to define how the sequence of | |
| strings must be converted. However, some additional control may sometimes | |
| be required. For example, we may want to make sure that a date in a format | |
| ``YYYY/MM/DD`` is converted to a :class:`datetime` object, or that a string | |
| like ``xx%`` is properly converted to a float between 0 and 1. In such | |
| cases, we should define conversion functions with the :keyword:`converters` | |
| arguments. | |
| The value of this argument is typically a dictionary with column indices or | |
| column names as keys and a conversion functions as values. These | |
| conversion functions can either be actual functions or lambda functions. In | |
| any case, they should accept only a string as input and output only a | |
| single element of the wanted type. | |
| In the following example, the second column is converted from as string | |
| representing a percentage to a float between 0 and 1:: | |
| >>> convertfunc = lambda x: float(x.strip("%"))/100. | |
| >>> data = "1, 2.3%, 45.\n6, 78.9%, 0" | |
| >>> names = ("i", "p", "n") | |
| >>> # General case ..... | |
| >>> np.genfromtxt(StringIO(data), delimiter=",", names=names) | |
| array([(1.0, nan, 45.0), (6.0, nan, 0.0)], | |
| dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) | |
| We need to keep in mind that by default, ``dtype=float``. A float is | |
| therefore expected for the second column. However, the strings ``' 2.3%'`` | |
| and ``' 78.9%'`` cannot be converted to float and we end up having | |
| ``np.nan`` instead. Let's now use a converter:: | |
| >>> # Converted case ... | |
| >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, | |
| ... converters={1: convertfunc}) | |
| array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], | |
| dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) | |
| The same results can be obtained by using the name of the second column | |
| (``"p"``) as key instead of its index (1):: | |
| >>> # Using a name for the converter ... | |
| >>> np.genfromtxt(StringIO(data), delimiter=",", names=names, | |
| ... converters={"p": convertfunc}) | |
| array([(1.0, 0.023, 45.0), (6.0, 0.78900000000000003, 0.0)], | |
| dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')]) | |
| Converters can also be used to provide a default for missing entries. In | |
| the following example, the converter ``convert`` transforms a stripped | |
| string into the corresponding float or into -999 if the string is empty. | |
| We need to explicitly strip the string from white spaces as it is not done | |
| by default:: | |
| >>> data = "1, , 3\n 4, 5, 6" | |
| >>> convert = lambda x: float(x.strip() or -999) | |
| >>> np.genfromtxt(StringIO(data), delimiter=",", | |
| ... converter={1: convert}) | |
| array([[ 1., -999., 3.], | |
| [ 4., 5., 6.]]) | |
| Using missing and filling values | |
| -------------------------------- | |
| Some entries may be missing in the dataset we are trying to import. In a | |
| previous example, we used a converter to transform an empty string into a | |
| float. However, user-defined converters may rapidly become cumbersome to | |
| manage. | |
| The :func:`~nummpy.genfromtxt` function provides two other complementary | |
| mechanisms: the :keyword:`missing_values` argument is used to recognize | |
| missing data and a second argument, :keyword:`filling_values`, is used to | |
| process these missing data. | |
| :keyword:`missing_values` | |
| ------------------------- | |
| By default, any empty string is marked as missing. We can also consider | |
| more complex strings, such as ``"N/A"`` or ``"???"`` to represent missing | |
| or invalid data. The :keyword:`missing_values` argument accepts three kind | |
| of values: | |
| a string or a comma-separated string | |
| This string will be used as the marker for missing data for all the | |
| columns | |
| a sequence of strings | |
| In that case, each item is associated to a column, in order. | |
| a dictionary | |
| Values of the dictionary are strings or sequence of strings. The | |
| corresponding keys can be column indices (integers) or column names | |
| (strings). In addition, the special key ``None`` can be used to | |
| define a default applicable to all columns. | |
| :keyword:`filling_values` | |
| ------------------------- | |
| We know how to recognize missing data, but we still need to provide a value | |
| for these missing entries. By default, this value is determined from the | |
| expected dtype according to this table: | |
| ============= ============== | |
| Expected type Default | |
| ============= ============== | |
| ``bool`` ``False`` | |
| ``int`` ``-1`` | |
| ``float`` ``np.nan`` | |
| ``complex`` ``np.nan+0j`` | |
| ``string`` ``'???'`` | |
| ============= ============== | |
| We can get a finer control on the conversion of missing values with the | |
| :keyword:`filling_values` optional argument. Like | |
| :keyword:`missing_values`, this argument accepts different kind of values: | |
| a single value | |
| This will be the default for all columns | |
| a sequence of values | |
| Each entry will be the default for the corresponding column | |
| a dictionary | |
| Each key can be a column index or a column name, and the | |
| corresponding value should be a single object. We can use the | |
| special key ``None`` to define a default for all columns. | |
| In the following example, we suppose that the missing values are flagged | |
| with ``"N/A"`` in the first column and by ``"???"`` in the third column. | |
| We wish to transform these missing values to 0 if they occur in the first | |
| and second column, and to -999 if they occur in the last column:: | |
| >>> data = "N/A, 2, 3\n4, ,???" | |
| >>> kwargs = dict(delimiter=",", | |
| ... dtype=int, | |
| ... names="a,b,c", | |
| ... missing_values={0:"N/A", 'b':" ", 2:"???"}, | |
| ... filling_values={0:0, 'b':0, 2:-999}) | |
| >>> np.genfromtxt(StringIO.StringIO(data), **kwargs) | |
| array([(0, 2, 3), (4, 0, -999)], | |
| dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')]) | |
| :keyword:`usemask` | |
| ------------------ | |
| We may also want to keep track of the occurrence of missing data by | |
| constructing a boolean mask, with ``True`` entries where data was missing | |
| and ``False`` otherwise. To do that, we just have to set the optional | |
| argument :keyword:`usemask` to ``True`` (the default is ``False``). The | |
| output array will then be a :class:`~numpy.ma.MaskedArray`. | |
| .. unpack=None, loose=True, invalid_raise=True) | |
| Shortcut functions | |
| ================== | |
| In addition to :func:`~numpy.genfromtxt`, the :mod:`numpy.lib.io` module | |
| provides several convenience functions derived from | |
| :func:`~numpy.genfromtxt`. These functions work the same way as the | |
| original, but they have different default values. | |
| :func:`~numpy.ndfromtxt` | |
| Always set ``usemask=False``. | |
| The output is always a standard :class:`numpy.ndarray`. | |
| :func:`~numpy.mafromtxt` | |
| Always set ``usemask=True``. | |
| The output is always a :class:`~numpy.ma.MaskedArray` | |
| :func:`~numpy.recfromtxt` | |
| Returns a standard :class:`numpy.recarray` (if ``usemask=False``) or a | |
| :class:`~numpy.ma.MaskedRecords` array (if ``usemaske=True``). The | |
| default dtype is ``dtype=None``, meaning that the types of each column | |
| will be automatically determined. | |
| :func:`~numpy.recfromcsv` | |
| Like :func:`~numpy.recfromtxt`, but with a default ``delimiter=","``. | |