Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Jun 25

Commit

0df139b

1 Parent(s): f960d11

Bump version

Browse files

Files changed (4) hide show

README.md +10 -5
poetry.lock +5 -5
pyproject.toml +1 -1
tests/renderers/test_chunk_renderer.py +0 -1

README.md CHANGED Viewed

@@ -3,12 +3,12 @@
 Marker converts documents to markdown, JSON, and HTML quickly and accurately.
 - Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
-- Does structured extraction, given a JSON schema (beta)
 - Formats tables, forms, equations, inline math, links, references, and code blocks
 - Extracts and saves images
 - Removes headers/footers/other artifacts
 - Extensible with your own formatting and logic
-- Optionally boost accuracy with LLMs
 - Works on GPU, CPU, or MPS
 ## Performance
@@ -17,7 +17,7 @@ Marker converts documents to markdown, JSON, and HTML quickly and accurately.
 Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
-The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
@@ -61,7 +61,7 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
 # Installation
-You'll need python 3.10+ and PyTorch.  You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine.  See [here](https://pytorch.org/get-started/locally/) for more details.
 Install with:
@@ -81,6 +81,7 @@ First, some configuration:
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
 - Some PDFs, even digital ones, have bad text in them.  Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
 ## Interactive App
@@ -101,7 +102,7 @@ You can pass in PDFs or images.
 Options:
 - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
-- `--output_format [markdown|json|html]`: Specify the format for the output results.
 - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
 - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
 - `--use_llm`: Uses an LLM to improve accuracy.  You will need to configure the LLM backend - see [below](#llm-services).
@@ -345,6 +346,10 @@ Note that child blocks of pages can have their own children as well (a tree stru
 ```
 ## Metadata
 All output formats will return a metadata dictionary, with the following fields:

 Marker converts documents to markdown, JSON, and HTML quickly and accurately.
 - Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
 - Formats tables, forms, equations, inline math, links, references, and code blocks
 - Extracts and saves images
 - Removes headers/footers/other artifacts
 - Extensible with your own formatting and logic
+- Does structured extraction, given a JSON schema (beta)
+- Optionally boost accuracy with LLMs (and your own prompt)
 - Works on GPU, CPU, or MPS
 ## Performance
 Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
+The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 25 pages/second on an H100.
 See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
 # Installation
+You'll need python 3.10+ and [PyTorch](https://pytorch.org/get-started/locally/).
 Install with:
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
 - Some PDFs, even digital ones, have bad text in them.  Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
+- If you care about inline math, set `format_lines` to automatically convert inline math to LaTeX.
 ## Interactive App
 Options:
 - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
+- `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
 - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
 - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
 - `--use_llm`: Uses an LLM to improve accuracy.  You will need to configure the LLM backend - see [below](#llm-services).
 ```
+## Chunks
+Chunks format is similar to JSON, but flattens everything into a single list instead of a tree.  Only the top level blocks from each page show up. It also has the full HTML of each block inside, so you don't need to crawl the tree to reconstruct it.
 ## Metadata
 All output formats will return a metadata dictionary, with the following fields:

poetry.lock CHANGED Viewed

@@ -3366,10 +3366,10 @@ files = [
 [package.dependencies]
 numpy = [
     {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
-    {version = ">=1.23.5", markers = "python_version == \"3.11\""},
-    {version = ">=1.21.4", markers = "python_version == \"3.10\" and platform_system == \"Darwin\""},
-    {version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version == \"3.10\""},
 ]
 [[package]]
@@ -3466,9 +3466,9 @@ files = [
 [package.dependencies]
 numpy = [
-    {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
-    {version = ">=1.23.2", markers = "python_version == \"3.11\""},
     {version = ">=1.22.4", markers = "python_version < \"3.11\""},
 ]
 python-dateutil = ">=2.8.2"
 pytz = ">=2020.1"

 [package.dependencies]
 numpy = [
+    {version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\""},
+    {version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\""},
+    {version = ">=1.23.5", markers = "python_version >= \"3.11\""},
     {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
 ]
 [[package]]
 [package.dependencies]
 numpy = [
     {version = ">=1.22.4", markers = "python_version < \"3.11\""},
+    {version = ">=1.23.2", markers = "python_version == \"3.11\""},
+    {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
 ]
 python-dateutil = ">=2.8.2"
 pytz = ">=2020.1"

pyproject.toml CHANGED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "marker-pdf"
-version = "1.7.5"
 description = "Convert documents to markdown with high speed and accuracy."
 authors = ["Vik Paruchuri <github@vikas.sh>"]
 readme = "README.md"

 [tool.poetry]
 name = "marker-pdf"
+version = "1.8.0"
 description = "Convert documents to markdown with high speed and accuracy."
 authors = ["Vik Paruchuri <github@vikas.sh>"]
 readme = "README.md"

tests/renderers/test_chunk_renderer.py CHANGED Viewed

@@ -7,7 +7,6 @@ from marker.renderers.chunk import ChunkRenderer
 def test_markdown_renderer_pagination(pdf_document):
     renderer = ChunkRenderer()
     blocks = renderer(pdf_document).blocks
-    breakpoint()
     assert len(blocks) == 15
     assert blocks[0].block_type == "SectionHeader"

 def test_markdown_renderer_pagination(pdf_document):
     renderer = ChunkRenderer()
     blocks = renderer(pdf_document).blocks
     assert len(blocks) == 15
     assert blocks[0].block_type == "SectionHeader"