Vik Paruchuri
commited on
Commit
·
0df139b
1
Parent(s):
f960d11
Bump version
Browse files- README.md +10 -5
- poetry.lock +5 -5
- pyproject.toml +1 -1
- tests/renderers/test_chunk_renderer.py +0 -1
README.md
CHANGED
|
@@ -3,12 +3,12 @@
|
|
| 3 |
Marker converts documents to markdown, JSON, and HTML quickly and accurately.
|
| 4 |
|
| 5 |
- Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
|
| 6 |
-
- Does structured extraction, given a JSON schema (beta)
|
| 7 |
- Formats tables, forms, equations, inline math, links, references, and code blocks
|
| 8 |
- Extracts and saves images
|
| 9 |
- Removes headers/footers/other artifacts
|
| 10 |
- Extensible with your own formatting and logic
|
| 11 |
-
-
|
|
|
|
| 12 |
- Works on GPU, CPU, or MPS
|
| 13 |
|
| 14 |
## Performance
|
|
@@ -17,7 +17,7 @@ Marker converts documents to markdown, JSON, and HTML quickly and accurately.
|
|
| 17 |
|
| 18 |
Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
|
| 19 |
|
| 20 |
-
The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of
|
| 21 |
|
| 22 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 23 |
|
|
@@ -61,7 +61,7 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
|
|
| 61 |
|
| 62 |
# Installation
|
| 63 |
|
| 64 |
-
You'll need python 3.10+ and PyTorch
|
| 65 |
|
| 66 |
Install with:
|
| 67 |
|
|
@@ -81,6 +81,7 @@ First, some configuration:
|
|
| 81 |
|
| 82 |
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 83 |
- Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
|
|
|
|
| 84 |
|
| 85 |
## Interactive App
|
| 86 |
|
|
@@ -101,7 +102,7 @@ You can pass in PDFs or images.
|
|
| 101 |
|
| 102 |
Options:
|
| 103 |
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 104 |
-
- `--output_format [markdown|json|html]`: Specify the format for the output results.
|
| 105 |
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
| 106 |
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
|
| 107 |
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
|
|
@@ -345,6 +346,10 @@ Note that child blocks of pages can have their own children as well (a tree stru
|
|
| 345 |
|
| 346 |
```
|
| 347 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 348 |
## Metadata
|
| 349 |
|
| 350 |
All output formats will return a metadata dictionary, with the following fields:
|
|
|
|
| 3 |
Marker converts documents to markdown, JSON, and HTML quickly and accurately.
|
| 4 |
|
| 5 |
- Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
|
|
|
|
| 6 |
- Formats tables, forms, equations, inline math, links, references, and code blocks
|
| 7 |
- Extracts and saves images
|
| 8 |
- Removes headers/footers/other artifacts
|
| 9 |
- Extensible with your own formatting and logic
|
| 10 |
+
- Does structured extraction, given a JSON schema (beta)
|
| 11 |
+
- Optionally boost accuracy with LLMs (and your own prompt)
|
| 12 |
- Works on GPU, CPU, or MPS
|
| 13 |
|
| 14 |
## Performance
|
|
|
|
| 17 |
|
| 18 |
Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
|
| 19 |
|
| 20 |
+
The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 25 pages/second on an H100.
|
| 21 |
|
| 22 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 23 |
|
|
|
|
| 61 |
|
| 62 |
# Installation
|
| 63 |
|
| 64 |
+
You'll need python 3.10+ and [PyTorch](https://pytorch.org/get-started/locally/).
|
| 65 |
|
| 66 |
Install with:
|
| 67 |
|
|
|
|
| 81 |
|
| 82 |
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 83 |
- Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
|
| 84 |
+
- If you care about inline math, set `format_lines` to automatically convert inline math to LaTeX.
|
| 85 |
|
| 86 |
## Interactive App
|
| 87 |
|
|
|
|
| 102 |
|
| 103 |
Options:
|
| 104 |
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 105 |
+
- `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
|
| 106 |
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
| 107 |
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
|
| 108 |
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
|
|
|
|
| 346 |
|
| 347 |
```
|
| 348 |
|
| 349 |
+
## Chunks
|
| 350 |
+
|
| 351 |
+
Chunks format is similar to JSON, but flattens everything into a single list instead of a tree. Only the top level blocks from each page show up. It also has the full HTML of each block inside, so you don't need to crawl the tree to reconstruct it.
|
| 352 |
+
|
| 353 |
## Metadata
|
| 354 |
|
| 355 |
All output formats will return a metadata dictionary, with the following fields:
|
poetry.lock
CHANGED
|
@@ -3366,10 +3366,10 @@ files = [
|
|
| 3366 |
|
| 3367 |
[package.dependencies]
|
| 3368 |
numpy = [
|
|
|
|
|
|
|
|
|
|
| 3369 |
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 3370 |
-
{version = ">=1.23.5", markers = "python_version == \"3.11\""},
|
| 3371 |
-
{version = ">=1.21.4", markers = "python_version == \"3.10\" and platform_system == \"Darwin\""},
|
| 3372 |
-
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version == \"3.10\""},
|
| 3373 |
]
|
| 3374 |
|
| 3375 |
[[package]]
|
|
@@ -3466,9 +3466,9 @@ files = [
|
|
| 3466 |
|
| 3467 |
[package.dependencies]
|
| 3468 |
numpy = [
|
| 3469 |
-
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 3470 |
-
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
|
| 3471 |
{version = ">=1.22.4", markers = "python_version < \"3.11\""},
|
|
|
|
|
|
|
| 3472 |
]
|
| 3473 |
python-dateutil = ">=2.8.2"
|
| 3474 |
pytz = ">=2020.1"
|
|
|
|
| 3366 |
|
| 3367 |
[package.dependencies]
|
| 3368 |
numpy = [
|
| 3369 |
+
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\""},
|
| 3370 |
+
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\""},
|
| 3371 |
+
{version = ">=1.23.5", markers = "python_version >= \"3.11\""},
|
| 3372 |
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
|
|
|
|
|
|
|
|
|
| 3373 |
]
|
| 3374 |
|
| 3375 |
[[package]]
|
|
|
|
| 3466 |
|
| 3467 |
[package.dependencies]
|
| 3468 |
numpy = [
|
|
|
|
|
|
|
| 3469 |
{version = ">=1.22.4", markers = "python_version < \"3.11\""},
|
| 3470 |
+
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
|
| 3471 |
+
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 3472 |
]
|
| 3473 |
python-dateutil = ">=2.8.2"
|
| 3474 |
pytz = ">=2020.1"
|
pyproject.toml
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
[tool.poetry]
|
| 2 |
name = "marker-pdf"
|
| 3 |
-
version = "1.
|
| 4 |
description = "Convert documents to markdown with high speed and accuracy."
|
| 5 |
authors = ["Vik Paruchuri <github@vikas.sh>"]
|
| 6 |
readme = "README.md"
|
|
|
|
| 1 |
[tool.poetry]
|
| 2 |
name = "marker-pdf"
|
| 3 |
+
version = "1.8.0"
|
| 4 |
description = "Convert documents to markdown with high speed and accuracy."
|
| 5 |
authors = ["Vik Paruchuri <github@vikas.sh>"]
|
| 6 |
readme = "README.md"
|
tests/renderers/test_chunk_renderer.py
CHANGED
|
@@ -7,7 +7,6 @@ from marker.renderers.chunk import ChunkRenderer
|
|
| 7 |
def test_markdown_renderer_pagination(pdf_document):
|
| 8 |
renderer = ChunkRenderer()
|
| 9 |
blocks = renderer(pdf_document).blocks
|
| 10 |
-
breakpoint()
|
| 11 |
|
| 12 |
assert len(blocks) == 15
|
| 13 |
assert blocks[0].block_type == "SectionHeader"
|
|
|
|
| 7 |
def test_markdown_renderer_pagination(pdf_document):
|
| 8 |
renderer = ChunkRenderer()
|
| 9 |
blocks = renderer(pdf_document).blocks
|
|
|
|
| 10 |
|
| 11 |
assert len(blocks) == 15
|
| 12 |
assert blocks[0].block_type == "SectionHeader"
|