Spaces:
Running
Running
| doctr.io | |
| ======== | |
| .. currentmodule:: doctr.io | |
| The io module enables users to easily access content from documents and export analysis | |
| results to structured formats. | |
| .. _document_structure: | |
| Document structure | |
| ------------------ | |
| Structural organization of the documents. | |
| Word | |
| ^^^^ | |
| A Word is an uninterrupted sequence of characters. | |
| .. autoclass:: Word | |
| Line | |
| ^^^^ | |
| A Line is a collection of Words aligned spatially and meant to be read together (on a two-column page, on the same horizontal, we will consider that there are two Lines). | |
| .. autoclass:: Line | |
| Artefact | |
| ^^^^^^^^ | |
| An Artefact is a non-textual element (e.g. QR code, picture, chart, signature, logo, etc.). | |
| .. autoclass:: Artefact | |
| Block | |
| ^^^^^ | |
| A Block is a collection of Lines (e.g. an address written on several lines) and Artefacts (e.g. a graph with its title underneath). | |
| .. autoclass:: Block | |
| Page | |
| ^^^^ | |
| A Page is a collection of Blocks that were on the same physical page. | |
| .. autoclass:: Page | |
| .. automethod:: show | |
| Document | |
| ^^^^^^^^ | |
| A Document is a collection of Pages. | |
| .. autoclass:: Document | |
| .. automethod:: show | |
| File reading | |
| ------------ | |
| High-performance file reading and conversion to processable structured data. | |
| .. autofunction:: read_pdf | |
| .. autofunction:: read_img_as_numpy | |
| .. autofunction:: read_img_as_tensor | |
| .. autofunction:: decode_img_as_tensor | |
| .. autofunction:: read_html | |
| .. autoclass:: DocumentFile | |
| .. automethod:: from_pdf | |
| .. automethod:: from_url | |
| .. automethod:: from_images | |