| Metadata-Version: 2.1 |
| Name: identify |
| Version: 2.5.21 |
| Summary: File identification library for Python |
| Home-page: https://github.com/pre-commit/identify |
| Author: Chris Kuehl |
| Author-email: ckuehl@ocf.berkeley.edu |
| License: MIT |
| Classifier: License :: OSI Approved :: MIT License |
| Classifier: Programming Language :: Python :: 3 |
| Classifier: Programming Language :: Python :: 3 :: Only |
| Classifier: Programming Language :: Python :: Implementation :: CPython |
| Classifier: Programming Language :: Python :: Implementation :: PyPy |
| Requires-Python: >=3.7 |
| Description-Content-Type: text/markdown |
| License-File: LICENSE |
| Provides-Extra: license |
| Requires-Dist: ukkonen ; extra == 'license' |
|
|
| [](https://github.com/pre-commit/identify/actions/workflows/main.yml) |
| [](https://results.pre-commit.ci/latest/github/pre-commit/identify/main) |
|
|
| identify |
| ======== |
|
|
| File identification library for Python. |
|
|
| Given a file (or some information about a file), return a set of standardized |
| tags identifying what the file is. |
|
|
| |
|
|
| ```bash |
| pip install identify |
| ``` |
|
|
| |
| |
|
|
| If you have an actual file on disk, you can get the most information possible |
| (a superset of all other methods): |
|
|
| ```python |
| >>> from identify import identify |
| >>> identify.tags_from_path('/path/to/file.py') |
| {'file', 'text', 'python', 'non-executable'} |
| >>> identify.tags_from_path('/path/to/file-with-shebang') |
| {'file', 'text', 'shell', 'bash', 'executable'} |
| >>> identify.tags_from_path('/bin/bash') |
| {'file', 'binary', 'executable'} |
| >>> identify.tags_from_path('/path/to/directory') |
| {'directory'} |
| >>> identify.tags_from_path('/path/to/symlink') |
| {'symlink'} |
| ``` |
|
|
| When using a file on disk, the checks performed are: |
|
|
| * File type (file, symlink, directory, socket) |
| * Mode (is it executable?) |
| * File name (mostly based on extension) |
| * If executable, the shebang is read and the interpreter interpreted |
|
|
|
|
| |
|
|
| ```python |
| >>> identify.tags_from_filename('file.py') |
| {'text', 'python'} |
| ``` |
|
|
|
|
| |
|
|
| ```python |
| >>> identify.tags_from_interpreter('python3.5') |
| {'python', 'python3'} |
| >>> identify.tags_from_interpreter('bash') |
| {'shell', 'bash'} |
| >>> identify.tags_from_interpreter('some-unrecognized-thing') |
| set() |
| ``` |
|
|
| |
|
|
| ``` |
| $ identify-cli --help |
| usage: identify-cli [-h] [--filename-only] path |
|
|
| positional arguments: |
| path |
|
|
| optional arguments: |
| -h, --help show this help message and exit |
| --filename-only |
| ``` |
|
|
| ```console |
| $ identify-cli setup.py; echo $? |
| ["file", "non-executable", "python", "text"] |
| 0 |
| $ identify-cli setup.py --filename-only; echo $? |
| ["python", "text"] |
| 0 |
| $ identify-cli wat.wat; echo $? |
| wat.wat does not exist. |
| 1 |
| $ identify-cli wat.wat --filename-only; echo $? |
| 1 |
| ``` |
|
|
| |
|
|
| `identify` also has an api for determining what type of license is contained |
| in a file. This routine is roughly based on the approaches used by |
| [licensee] (the ruby gem that github uses to figure out the license for a |
| repo). |
|
|
| The approach that `identify` uses is as follows: |
|
|
| 1. Strip the copyright line |
| 2. Normalize all whitespace |
| 3. Return any exact matches |
| 4. Return the closest by edit distance (where edit distance < 5%) |
|
|
| To use the api, install via `pip install identify[license]` |
|
|
| ```pycon |
| >>> from identify import identify |
| >>> identify.license_id('LICENSE') |
| 'MIT' |
| ``` |
|
|
| The return value of the `license_id` function is an [SPDX] id. Currently |
| licenses are sourced from [choosealicense.com]. |
|
|
| [licensee]: https://github.com/benbalter/licensee |
| [SPDX]: https://spdx.org/licenses/ |
| [choosealicense.com]: https://github.com/github/choosealicense.com |
|
|
| |
|
|
| A call to `tags_from_path` does this: |
|
|
| 1. What is the type: file, symlink, directory? If it's not file, stop here. |
| 2. Is it executable? Add the appropriate tag. |
| 3. Do we recognize the file extension? If so, add the appropriate tags, stop |
| here. These tags would include binary/text. |
| 4. Peek at the first X bytes of the file. Use these to determine whether it is |
| binary or text, add the appropriate tag. |
| 5. If identified as text above, try to read and interpret the shebang, and add |
| appropriate tags. |
| |
| By design, this means we don't need to partially read files where we recognize |
| the file extension. |
|
|