Vik Paruchuri commited on
Commit
f9e3dde
·
1 Parent(s): c7d0c93

Bugfixes and new features

Browse files
README.md CHANGED
@@ -38,16 +38,16 @@ The above results are with marker and nougat setup so they each take ~4GB of VRA
38
 
39
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
40
 
 
 
 
 
41
  # Commercial usage
42
 
43
  I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
44
 
45
  The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
46
 
47
- # Hosted API
48
-
49
- There is a hosted API for marker available [here](https://www.datalab.to/). It's currently in beta, and I'm working on optimizing speed.
50
-
51
  # Community
52
 
53
  [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
@@ -147,6 +147,15 @@ There are some settings that you may find useful if things aren't working the wa
147
 
148
  In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
149
 
 
 
 
 
 
 
 
 
 
150
  # Benchmarks
151
 
152
  Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
 
38
 
39
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
40
 
41
+ # Hosted API
42
+
43
+ There is a hosted API for marker available [here](https://www.datalab.to/). It has been tuned for performance, and generally takes 10s + 1s/page for conversion.
44
+
45
  # Commercial usage
46
 
47
  I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
48
 
49
  The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
50
 
 
 
 
 
51
  # Community
52
 
53
  [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
 
147
 
148
  In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
149
 
150
+ ## Useful settings
151
+
152
+ These settings can improve/change output quality:
153
+
154
+ - `OCR_ALL_PAGES` will force OCR across the document. Many PDFs have bad text embedded due to older OCR engines being used.
155
+ - `PAGINATE_OUTPUT` will put a horizontal rule between pages. Default: False.
156
+ - `EXTRACT_IMAGES` will extract images and save separately. Default: True.
157
+ - `BAD_SPAN_TYPES` specifies layout blocks to remove from the markdown output.
158
+
159
  # Benchmarks
160
 
161
  Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods. It's noisy, but at least directionally correct.
convert.py CHANGED
@@ -23,6 +23,9 @@ configure_logging()
23
 
24
 
25
  def worker_init(shared_model):
 
 
 
26
  global model_refs
27
  model_refs = shared_model
28
 
@@ -105,17 +108,22 @@ def main():
105
  tasks_per_gpu = settings.INFERENCE_RAM // settings.VRAM_PER_TASK if settings.CUDA else 0
106
  total_processes = min(tasks_per_gpu, total_processes)
107
 
108
- mp.set_start_method('spawn') # Required for CUDA, forkserver doesn't work
109
- model_lst = load_all_models()
 
 
110
 
111
- for model in model_lst:
112
- if model is None:
113
- continue
114
 
115
- if model.device.type == "mps":
116
- raise ValueError("Cannot use MPS with torch multiprocessing share_memory. You have to use CUDA or CPU. Set the TORCH_DEVICE environment variable to change the device.")
 
117
 
118
- model.share_memory()
 
 
 
119
 
120
  print(f"Converting {len(files_to_convert)} pdfs in chunk {args.chunk_idx + 1}/{args.num_chunks} with {total_processes} processes, and storing in {out_folder}")
121
  task_args = [(f, out_folder, metadata.get(os.path.basename(f)), args.min_length) for f in files_to_convert]
 
23
 
24
 
25
  def worker_init(shared_model):
26
+ if shared_model is None:
27
+ shared_model = load_all_models()
28
+
29
  global model_refs
30
  model_refs = shared_model
31
 
 
108
  tasks_per_gpu = settings.INFERENCE_RAM // settings.VRAM_PER_TASK if settings.CUDA else 0
109
  total_processes = min(tasks_per_gpu, total_processes)
110
 
111
+ try:
112
+ mp.set_start_method('spawn') # Required for CUDA, forkserver doesn't work
113
+ except RuntimeError:
114
+ raise RuntimeError("Set start method to spawn twice. This may be a temporary issue with the script. Please try running it again.")
115
 
116
+ if settings.TORCH_DEVICE == "mps" or settings.TORCH_DEVICE_MODEL == "mps":
117
+ print("Cannot use MPS with torch multiprocessing share_memory. This will make things less memory efficient. If you want to share memory, you have to use CUDA or CPU. Set the TORCH_DEVICE environment variable to change the device.")
 
118
 
119
+ model_lst = None
120
+ else:
121
+ model_lst = load_all_models()
122
 
123
+ for model in model_lst:
124
+ if model is None:
125
+ continue
126
+ model.share_memory()
127
 
128
  print(f"Converting {len(files_to_convert)} pdfs in chunk {args.chunk_idx + 1}/{args.num_chunks} with {total_processes} processes, and storing in {out_folder}")
129
  task_args = [(f, out_folder, metadata.get(os.path.basename(f)), args.min_length) for f in files_to_convert]
marker/images/extract.py CHANGED
@@ -39,6 +39,11 @@ def extract_page_images(page_obj, page):
39
  image_blocks = find_image_blocks(page)
40
 
41
  for image_idx, (block_idx, line_idx, bbox) in enumerate(image_blocks):
 
 
 
 
 
42
  block = page.blocks[block_idx]
43
  image = render_bbox_image(page_obj, page, bbox)
44
  image_filename = get_image_filename(page, image_idx)
 
39
  image_blocks = find_image_blocks(page)
40
 
41
  for image_idx, (block_idx, line_idx, bbox) in enumerate(image_blocks):
42
+ if block_idx >= len(page.blocks):
43
+ block_idx = len(page.blocks) - 1
44
+ if block_idx < 0:
45
+ continue
46
+
47
  block = page.blocks[block_idx]
48
  image = render_bbox_image(page_obj, page, bbox)
49
  image_filename = get_image_filename(page, image_idx)
marker/postprocessors/markdown.py CHANGED
@@ -4,6 +4,8 @@ import re
4
  import regex
5
  from typing import List
6
 
 
 
7
 
8
  def escape_markdown(text):
9
  # List of characters that need to be escaped in markdown
@@ -143,7 +145,7 @@ def merge_lines(blocks: List[List[MergedBlock]]):
143
  block_text = ""
144
  block_type = ""
145
 
146
- for page in blocks:
147
  for block in page:
148
  block_type = block.block_type
149
  if block_type != prev_type and prev_type:
@@ -168,6 +170,9 @@ def merge_lines(blocks: List[List[MergedBlock]]):
168
  else:
169
  block_text = line.text
170
 
 
 
 
171
  # Append the final block
172
  text_blocks.append(
173
  FullyMergedBlock(
 
4
  import regex
5
  from typing import List
6
 
7
+ from marker.settings import settings
8
+
9
 
10
  def escape_markdown(text):
11
  # List of characters that need to be escaped in markdown
 
145
  block_text = ""
146
  block_type = ""
147
 
148
+ for idx, page in enumerate(blocks):
149
  for block in page:
150
  block_type = block.block_type
151
  if block_type != prev_type and prev_type:
 
170
  else:
171
  block_text = line.text
172
 
173
+ if settings.PAGINATE_OUTPUT and idx < len(blocks) - 1:
174
+ block_text += "\n\n" + "-" * 16 + "\n\n" # Page separator horizontal rule
175
+
176
  # Append the final block
177
  text_blocks.append(
178
  FullyMergedBlock(
marker/settings.py CHANGED
@@ -11,6 +11,7 @@ class Settings(BaseSettings):
11
  TORCH_DEVICE: Optional[str] = None # Note: MPS device does not work for text detection, and will default to CPU
12
  IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
13
  EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
 
14
 
15
  @computed_field
16
  @property
 
11
  TORCH_DEVICE: Optional[str] = None # Note: MPS device does not work for text detection, and will default to CPU
12
  IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
13
  EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
14
+ PAGINATE_OUTPUT: bool = False # Paginate output markdown
15
 
16
  @computed_field
17
  @property
pyproject.toml CHANGED
@@ -1,6 +1,6 @@
1
  [tool.poetry]
2
  name = "marker-pdf"
3
- version = "0.2.13"
4
  description = "Convert PDF to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"
 
1
  [tool.poetry]
2
  name = "marker-pdf"
3
+ version = "0.2.14"
4
  description = "Convert PDF to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"