Vik Paruchuri commited on
Commit
0df139b
·
1 Parent(s): f960d11

Bump version

Browse files
README.md CHANGED
@@ -3,12 +3,12 @@
3
  Marker converts documents to markdown, JSON, and HTML quickly and accurately.
4
 
5
  - Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
6
- - Does structured extraction, given a JSON schema (beta)
7
  - Formats tables, forms, equations, inline math, links, references, and code blocks
8
  - Extracts and saves images
9
  - Removes headers/footers/other artifacts
10
  - Extensible with your own formatting and logic
11
- - Optionally boost accuracy with LLMs
 
12
  - Works on GPU, CPU, or MPS
13
 
14
  ## Performance
@@ -17,7 +17,7 @@ Marker converts documents to markdown, JSON, and HTML quickly and accurately.
17
 
18
  Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
19
 
20
- The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
21
 
22
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
23
 
@@ -61,7 +61,7 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
61
 
62
  # Installation
63
 
64
- You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
65
 
66
  Install with:
67
 
@@ -81,6 +81,7 @@ First, some configuration:
81
 
82
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
83
  - Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
 
84
 
85
  ## Interactive App
86
 
@@ -101,7 +102,7 @@ You can pass in PDFs or images.
101
 
102
  Options:
103
  - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
104
- - `--output_format [markdown|json|html]`: Specify the format for the output results.
105
  - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
106
  - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
107
  - `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
@@ -345,6 +346,10 @@ Note that child blocks of pages can have their own children as well (a tree stru
345
 
346
  ```
347
 
 
 
 
 
348
  ## Metadata
349
 
350
  All output formats will return a metadata dictionary, with the following fields:
 
3
  Marker converts documents to markdown, JSON, and HTML quickly and accurately.
4
 
5
  - Converts PDF, image, PPTX, DOCX, XLSX, HTML, EPUB files in all languages
 
6
  - Formats tables, forms, equations, inline math, links, references, and code blocks
7
  - Extracts and saves images
8
  - Removes headers/footers/other artifacts
9
  - Extensible with your own formatting and logic
10
+ - Does structured extraction, given a JSON schema (beta)
11
+ - Optionally boost accuracy with LLMs (and your own prompt)
12
  - Works on GPU, CPU, or MPS
13
 
14
  ## Performance
 
17
 
18
  Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
19
 
20
+ The above results are running single PDF pages serially. Marker is significantly faster when running in batch mode, with a projected throughput of 25 pages/second on an H100.
21
 
22
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
23
 
 
61
 
62
  # Installation
63
 
64
+ You'll need python 3.10+ and [PyTorch](https://pytorch.org/get-started/locally/).
65
 
66
  Install with:
67
 
 
81
 
82
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
83
  - Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
84
+ - If you care about inline math, set `format_lines` to automatically convert inline math to LaTeX.
85
 
86
  ## Interactive App
87
 
 
102
 
103
  Options:
104
  - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
105
+ - `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
106
  - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
107
  - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
108
  - `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
 
346
 
347
  ```
348
 
349
+ ## Chunks
350
+
351
+ Chunks format is similar to JSON, but flattens everything into a single list instead of a tree. Only the top level blocks from each page show up. It also has the full HTML of each block inside, so you don't need to crawl the tree to reconstruct it.
352
+
353
  ## Metadata
354
 
355
  All output formats will return a metadata dictionary, with the following fields:
poetry.lock CHANGED
@@ -3366,10 +3366,10 @@ files = [
3366
 
3367
  [package.dependencies]
3368
  numpy = [
 
 
 
3369
  {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
3370
- {version = ">=1.23.5", markers = "python_version == \"3.11\""},
3371
- {version = ">=1.21.4", markers = "python_version == \"3.10\" and platform_system == \"Darwin\""},
3372
- {version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version == \"3.10\""},
3373
  ]
3374
 
3375
  [[package]]
@@ -3466,9 +3466,9 @@ files = [
3466
 
3467
  [package.dependencies]
3468
  numpy = [
3469
- {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
3470
- {version = ">=1.23.2", markers = "python_version == \"3.11\""},
3471
  {version = ">=1.22.4", markers = "python_version < \"3.11\""},
 
 
3472
  ]
3473
  python-dateutil = ">=2.8.2"
3474
  pytz = ">=2020.1"
 
3366
 
3367
  [package.dependencies]
3368
  numpy = [
3369
+ {version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\""},
3370
+ {version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\""},
3371
+ {version = ">=1.23.5", markers = "python_version >= \"3.11\""},
3372
  {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
 
 
 
3373
  ]
3374
 
3375
  [[package]]
 
3466
 
3467
  [package.dependencies]
3468
  numpy = [
 
 
3469
  {version = ">=1.22.4", markers = "python_version < \"3.11\""},
3470
+ {version = ">=1.23.2", markers = "python_version == \"3.11\""},
3471
+ {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
3472
  ]
3473
  python-dateutil = ">=2.8.2"
3474
  pytz = ">=2020.1"
pyproject.toml CHANGED
@@ -1,6 +1,6 @@
1
  [tool.poetry]
2
  name = "marker-pdf"
3
- version = "1.7.5"
4
  description = "Convert documents to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"
 
1
  [tool.poetry]
2
  name = "marker-pdf"
3
+ version = "1.8.0"
4
  description = "Convert documents to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"
tests/renderers/test_chunk_renderer.py CHANGED
@@ -7,7 +7,6 @@ from marker.renderers.chunk import ChunkRenderer
7
  def test_markdown_renderer_pagination(pdf_document):
8
  renderer = ChunkRenderer()
9
  blocks = renderer(pdf_document).blocks
10
- breakpoint()
11
 
12
  assert len(blocks) == 15
13
  assert blocks[0].block_type == "SectionHeader"
 
7
  def test_markdown_renderer_pagination(pdf_document):
8
  renderer = ChunkRenderer()
9
  blocks = renderer(pdf_document).blocks
 
10
 
11
  assert len(blocks) == 15
12
  assert blocks[0].block_type == "SectionHeader"