Vik Paruchuri
commited on
Commit
·
f9e3dde
1
Parent(s):
c7d0c93
Bugfixes and new features
Browse files- README.md +13 -4
- convert.py +16 -8
- marker/images/extract.py +5 -0
- marker/postprocessors/markdown.py +6 -1
- marker/settings.py +1 -0
- pyproject.toml +1 -1
README.md
CHANGED
|
@@ -38,16 +38,16 @@ The above results are with marker and nougat setup so they each take ~4GB of VRA
|
|
| 38 |
|
| 39 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
# Commercial usage
|
| 42 |
|
| 43 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
| 44 |
|
| 45 |
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
|
| 46 |
|
| 47 |
-
# Hosted API
|
| 48 |
-
|
| 49 |
-
There is a hosted API for marker available [here](https://www.datalab.to/). It's currently in beta, and I'm working on optimizing speed.
|
| 50 |
-
|
| 51 |
# Community
|
| 52 |
|
| 53 |
[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
|
|
@@ -147,6 +147,15 @@ There are some settings that you may find useful if things aren't working the wa
|
|
| 147 |
|
| 148 |
In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
# Benchmarks
|
| 151 |
|
| 152 |
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
|
|
|
|
| 38 |
|
| 39 |
See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
|
| 40 |
|
| 41 |
+
# Hosted API
|
| 42 |
+
|
| 43 |
+
There is a hosted API for marker available [here](https://www.datalab.to/). It has been tuned for performance, and generally takes 10s + 1s/page for conversion.
|
| 44 |
+
|
| 45 |
# Commercial usage
|
| 46 |
|
| 47 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
| 48 |
|
| 49 |
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
# Community
|
| 52 |
|
| 53 |
[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
|
|
|
|
| 147 |
|
| 148 |
In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
|
| 149 |
|
| 150 |
+
## Useful settings
|
| 151 |
+
|
| 152 |
+
These settings can improve/change output quality:
|
| 153 |
+
|
| 154 |
+
- `OCR_ALL_PAGES` will force OCR across the document. Many PDFs have bad text embedded due to older OCR engines being used.
|
| 155 |
+
- `PAGINATE_OUTPUT` will put a horizontal rule between pages. Default: False.
|
| 156 |
+
- `EXTRACT_IMAGES` will extract images and save separately. Default: True.
|
| 157 |
+
- `BAD_SPAN_TYPES` specifies layout blocks to remove from the markdown output.
|
| 158 |
+
|
| 159 |
# Benchmarks
|
| 160 |
|
| 161 |
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
|
convert.py
CHANGED
|
@@ -23,6 +23,9 @@ configure_logging()
|
|
| 23 |
|
| 24 |
|
| 25 |
def worker_init(shared_model):
|
|
|
|
|
|
|
|
|
|
| 26 |
global model_refs
|
| 27 |
model_refs = shared_model
|
| 28 |
|
|
@@ -105,17 +108,22 @@ def main():
|
|
| 105 |
tasks_per_gpu = settings.INFERENCE_RAM // settings.VRAM_PER_TASK if settings.CUDA else 0
|
| 106 |
total_processes = min(tasks_per_gpu, total_processes)
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
continue
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
|
|
|
| 117 |
|
| 118 |
-
model
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
print(f"Converting {len(files_to_convert)} pdfs in chunk {args.chunk_idx + 1}/{args.num_chunks} with {total_processes} processes, and storing in {out_folder}")
|
| 121 |
task_args = [(f, out_folder, metadata.get(os.path.basename(f)), args.min_length) for f in files_to_convert]
|
|
|
|
| 23 |
|
| 24 |
|
| 25 |
def worker_init(shared_model):
|
| 26 |
+
if shared_model is None:
|
| 27 |
+
shared_model = load_all_models()
|
| 28 |
+
|
| 29 |
global model_refs
|
| 30 |
model_refs = shared_model
|
| 31 |
|
|
|
|
| 108 |
tasks_per_gpu = settings.INFERENCE_RAM // settings.VRAM_PER_TASK if settings.CUDA else 0
|
| 109 |
total_processes = min(tasks_per_gpu, total_processes)
|
| 110 |
|
| 111 |
+
try:
|
| 112 |
+
mp.set_start_method('spawn') # Required for CUDA, forkserver doesn't work
|
| 113 |
+
except RuntimeError:
|
| 114 |
+
raise RuntimeError("Set start method to spawn twice. This may be a temporary issue with the script. Please try running it again.")
|
| 115 |
|
| 116 |
+
if settings.TORCH_DEVICE == "mps" or settings.TORCH_DEVICE_MODEL == "mps":
|
| 117 |
+
print("Cannot use MPS with torch multiprocessing share_memory. This will make things less memory efficient. If you want to share memory, you have to use CUDA or CPU. Set the TORCH_DEVICE environment variable to change the device.")
|
|
|
|
| 118 |
|
| 119 |
+
model_lst = None
|
| 120 |
+
else:
|
| 121 |
+
model_lst = load_all_models()
|
| 122 |
|
| 123 |
+
for model in model_lst:
|
| 124 |
+
if model is None:
|
| 125 |
+
continue
|
| 126 |
+
model.share_memory()
|
| 127 |
|
| 128 |
print(f"Converting {len(files_to_convert)} pdfs in chunk {args.chunk_idx + 1}/{args.num_chunks} with {total_processes} processes, and storing in {out_folder}")
|
| 129 |
task_args = [(f, out_folder, metadata.get(os.path.basename(f)), args.min_length) for f in files_to_convert]
|
marker/images/extract.py
CHANGED
|
@@ -39,6 +39,11 @@ def extract_page_images(page_obj, page):
|
|
| 39 |
image_blocks = find_image_blocks(page)
|
| 40 |
|
| 41 |
for image_idx, (block_idx, line_idx, bbox) in enumerate(image_blocks):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
block = page.blocks[block_idx]
|
| 43 |
image = render_bbox_image(page_obj, page, bbox)
|
| 44 |
image_filename = get_image_filename(page, image_idx)
|
|
|
|
| 39 |
image_blocks = find_image_blocks(page)
|
| 40 |
|
| 41 |
for image_idx, (block_idx, line_idx, bbox) in enumerate(image_blocks):
|
| 42 |
+
if block_idx >= len(page.blocks):
|
| 43 |
+
block_idx = len(page.blocks) - 1
|
| 44 |
+
if block_idx < 0:
|
| 45 |
+
continue
|
| 46 |
+
|
| 47 |
block = page.blocks[block_idx]
|
| 48 |
image = render_bbox_image(page_obj, page, bbox)
|
| 49 |
image_filename = get_image_filename(page, image_idx)
|
marker/postprocessors/markdown.py
CHANGED
|
@@ -4,6 +4,8 @@ import re
|
|
| 4 |
import regex
|
| 5 |
from typing import List
|
| 6 |
|
|
|
|
|
|
|
| 7 |
|
| 8 |
def escape_markdown(text):
|
| 9 |
# List of characters that need to be escaped in markdown
|
|
@@ -143,7 +145,7 @@ def merge_lines(blocks: List[List[MergedBlock]]):
|
|
| 143 |
block_text = ""
|
| 144 |
block_type = ""
|
| 145 |
|
| 146 |
-
for page in blocks:
|
| 147 |
for block in page:
|
| 148 |
block_type = block.block_type
|
| 149 |
if block_type != prev_type and prev_type:
|
|
@@ -168,6 +170,9 @@ def merge_lines(blocks: List[List[MergedBlock]]):
|
|
| 168 |
else:
|
| 169 |
block_text = line.text
|
| 170 |
|
|
|
|
|
|
|
|
|
|
| 171 |
# Append the final block
|
| 172 |
text_blocks.append(
|
| 173 |
FullyMergedBlock(
|
|
|
|
| 4 |
import regex
|
| 5 |
from typing import List
|
| 6 |
|
| 7 |
+
from marker.settings import settings
|
| 8 |
+
|
| 9 |
|
| 10 |
def escape_markdown(text):
|
| 11 |
# List of characters that need to be escaped in markdown
|
|
|
|
| 145 |
block_text = ""
|
| 146 |
block_type = ""
|
| 147 |
|
| 148 |
+
for idx, page in enumerate(blocks):
|
| 149 |
for block in page:
|
| 150 |
block_type = block.block_type
|
| 151 |
if block_type != prev_type and prev_type:
|
|
|
|
| 170 |
else:
|
| 171 |
block_text = line.text
|
| 172 |
|
| 173 |
+
if settings.PAGINATE_OUTPUT and idx < len(blocks) - 1:
|
| 174 |
+
block_text += "\n\n" + "-" * 16 + "\n\n" # Page separator horizontal rule
|
| 175 |
+
|
| 176 |
# Append the final block
|
| 177 |
text_blocks.append(
|
| 178 |
FullyMergedBlock(
|
marker/settings.py
CHANGED
|
@@ -11,6 +11,7 @@ class Settings(BaseSettings):
|
|
| 11 |
TORCH_DEVICE: Optional[str] = None # Note: MPS device does not work for text detection, and will default to CPU
|
| 12 |
IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
|
| 13 |
EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
|
|
|
|
| 14 |
|
| 15 |
@computed_field
|
| 16 |
@property
|
|
|
|
| 11 |
TORCH_DEVICE: Optional[str] = None # Note: MPS device does not work for text detection, and will default to CPU
|
| 12 |
IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
|
| 13 |
EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
|
| 14 |
+
PAGINATE_OUTPUT: bool = False # Paginate output markdown
|
| 15 |
|
| 16 |
@computed_field
|
| 17 |
@property
|
pyproject.toml
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
[tool.poetry]
|
| 2 |
name = "marker-pdf"
|
| 3 |
-
version = "0.2.
|
| 4 |
description = "Convert PDF to markdown with high speed and accuracy."
|
| 5 |
authors = ["Vik Paruchuri <github@vikas.sh>"]
|
| 6 |
readme = "README.md"
|
|
|
|
| 1 |
[tool.poetry]
|
| 2 |
name = "marker-pdf"
|
| 3 |
+
version = "0.2.14"
|
| 4 |
description = "Convert PDF to markdown with high speed and accuracy."
|
| 5 |
authors = ["Vik Paruchuri <github@vikas.sh>"]
|
| 6 |
readme = "README.md"
|