Vik Paruchuri commited on
Commit
a04b2ff
·
1 Parent(s): 48dfedd

Clean up benchmarking

Browse files
README.md CHANGED
@@ -1,6 +1,6 @@
1
  # Marker
2
 
3
- Marker converts PDF, EPUB, and MOBI to Markdown. It is up to 10x faster than nougat, works across many types of documents, and minimizes hallucinations. It runs on GPU or CPU, with configurable parallelism.
4
 
5
  Features:
6
 
@@ -9,19 +9,21 @@ Features:
9
  - Removal of headers/footers/other artifacts
10
  - Latex conversion for most equations
11
  - Proper code block and table formatting
12
- - Support for multiple languages (although most testing is done in English)
13
- - Works on GPU, CPU, or MPS (mac)
14
 
15
  ## How it works
16
 
17
  Marker is a pipeline of steps and deep learning models:
18
 
19
- - OCR if text cannot be detected
20
- - Detect page layout
21
- - Format blocks properly based on layout
 
 
22
  - Postprocess extracted text
23
 
24
- Marker minimizes autoregression, which reduces the risk of hallucinations to close to zero, and improves speed. The only parts of the document that are passed through an LLM forward pass are equation regions.
25
 
26
  ## Limitations
27
 
@@ -46,12 +48,12 @@ This has been tested on Mac and Linux (Ubuntu).
46
 
47
  - Install the system requirements at `install/apt-requirements.txt` for Linux or `install/brew-requirements.txt` for Mac
48
  - Linux only: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/). You may get tesseract 4 otherwise.
49
- - Linux only: Install ghostscript > 9.55 (see `install/ghostscript_install.sh` for details).
50
  - Set the tesseract data folder path
51
  - Find the tesseract data folder `tessdata`
52
  - On mac, you can run `brew list tesseract`
53
- - Or run `find / -name tessdata` to find it
54
- - Create a `local.env` file with `TESSDATA_PREFIX=/path/to/tessdata` inside it
55
 
56
  **Python packages**
57
 
@@ -62,7 +64,8 @@ This has been tested on Mac and Linux (Ubuntu).
62
  **Configuration**
63
 
64
  - Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
65
- - If using GPU, set `MAX_TASKS_PER_GPU` to your GPU VRAM (per GPU) divided by 1.5 GB. For example, if you have 16 GB of VRAM, set `MAX_TASKS_PER_GPU=10`.
 
66
  - Inspect the settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
67
 
68
  ## Convert a single file
@@ -73,7 +76,7 @@ Run `convert_single.py`, like this:
73
  python convert_single.py /path/to/file.pdf /path/to/output.md --workers 4 --max_pages 10
74
  ```
75
 
76
- - `--workers` is the number of parallel processes to run. This is set to 1 by default, but you can increase it to speed up processing, at the cost of more CPU/GPU usage.
77
  - `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
78
 
79
  Make sure the `DEFAULT_LANG` setting is set correctly for this.
@@ -86,14 +89,14 @@ Run `convert.py`, like this:
86
  python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
87
  ```
88
 
89
- - `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to speed up processing, at the cost of more CPU/GPU usage. This should not be higher than the `MAX_TASKS_PER_GPU` setting.
90
  - `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
91
  - `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, `DEFAULT_LANG` will be used. The format is:
92
 
93
  ```
94
  {
95
  "pdf1.pdf": {"language": "English"},
96
- "pdf2.pdf": {"language": "English"},
97
  ...
98
  }
99
  ```
@@ -107,15 +110,21 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
107
  ```
108
 
109
  - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
110
- - `NUM_DEVICES` is the number of GPUs to use. Should be 2 or greater.
111
- - `NUM_WORKERS` is the number of parallel processes to run on each GPU. This should not be higher than the `MAX_TASKS_PER_GPU` setting.
112
 
113
  ## Benchmark
114
 
115
- You can also benchmark the performance of the pipeline on your machine. Run `benchmark.py`, like this:
116
 
117
  ```
118
  python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
119
  ```
120
 
121
- This will benchmark marker against other text extraction methods. Omit `--nougat` to exclude nougat from the benchmark. (it will take longer)
 
 
 
 
 
 
 
1
  # Marker
2
 
3
+ Marker converts PDF, EPUB, and MOBI to Markdown. It is up to 10x faster than nougat, works across many types of documents, and minimizes the risk of hallucinations significantly.
4
 
5
  Features:
6
 
 
9
  - Removal of headers/footers/other artifacts
10
  - Latex conversion for most equations
11
  - Proper code block and table formatting
12
+ - Support for multiple languages (although most testing is done in English). See `settings.py` for a list of supported languages.
13
+ - Works on GPU, CPU, or MPS
14
 
15
  ## How it works
16
 
17
  Marker is a pipeline of steps and deep learning models:
18
 
19
+ - Loop through each document page, and:
20
+ - OCR the page if text cannot be detected
21
+ - Detect page layout
22
+ - Format blocks properly based on layout
23
+ - Combine text from all pages
24
  - Postprocess extracted text
25
 
26
+ Marker minimizes the use of autoregressive models, which reduces the risk of hallucinations to close to zero, and improves speed. The only parts of a document that are passed through an LLM forward pass are equation blocks.
27
 
28
  ## Limitations
29
 
 
48
 
49
  - Install the system requirements at `install/apt-requirements.txt` for Linux or `install/brew-requirements.txt` for Mac
50
  - Linux only: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/). You may get tesseract 4 otherwise.
51
+ - Linux only: Install ghostscript > 9.55 (see `install/ghostscript_install.sh` for the commands).
52
  - Set the tesseract data folder path
53
  - Find the tesseract data folder `tessdata`
54
  - On mac, you can run `brew list tesseract`
55
+ - On linux, run `find / -name tessdata`
56
+ - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
57
 
58
  **Python packages**
59
 
 
64
  **Configuration**
65
 
66
  - Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
67
+ - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
68
+ - Depending on your document types, marker's average memory usage per task can vary. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
69
  - Inspect the settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
70
 
71
  ## Convert a single file
 
76
  python convert_single.py /path/to/file.pdf /path/to/output.md --workers 4 --max_pages 10
77
  ```
78
 
79
+ - `--workers` is the number of parallel CPU processes to run for OCR. This is set to 1 by default, but you can increase it to speed up processing.
80
  - `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
81
 
82
  Make sure the `DEFAULT_LANG` setting is set correctly for this.
 
89
  python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
90
  ```
91
 
92
+ - `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
93
  - `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
94
  - `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, `DEFAULT_LANG` will be used. The format is:
95
 
96
  ```
97
  {
98
  "pdf1.pdf": {"language": "English"},
99
+ "pdf2.pdf": {"language": "Spanish"},
100
  ...
101
  }
102
  ```
 
110
  ```
111
 
112
  - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
113
+ - `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
114
+ - `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
115
 
116
  ## Benchmark
117
 
118
+ You can benchmark the performance of marker on your machine. Run `benchmark.py`, like this:
119
 
120
  ```
121
  python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
122
  ```
123
 
124
+ This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each (4GB).
125
+
126
+ Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
127
+
128
+ # Commercial usage
129
+
130
+ Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
benchmark.py CHANGED
@@ -23,11 +23,11 @@ from tabulate import tabulate
23
  configure_logging()
24
 
25
 
26
- def nougat_prediction(pdf_filename, batch_size=1):
27
  out_dir = tempfile.mkdtemp()
28
  # No skipping avoids failure detection, so we attempt to convert the full doc
29
- # Batch size 1 is to compare to single-threaded marker
30
- subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--batchsize", str(batch_size)], check=True)
31
  md_file = os.listdir(out_dir)[0]
32
  with open(os.path.join(out_dir, md_file), "r") as f:
33
  data = f.read()
@@ -41,8 +41,8 @@ if __name__ == "__main__":
41
  parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
42
  parser.add_argument("out_file", help="Output filename")
43
  parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
44
- parser.add_argument("--nougat_batch_size", type=int, default=4, help="Batch size to use when making predictions")
45
- parser.add_argument("--marker_parallel", type=int, default=4, help="Number of marker processes to run in parallel")
46
  parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
47
  args = parser.parse_args()
48
 
 
23
  configure_logging()
24
 
25
 
26
+ def nougat_prediction(pdf_filename, batch_size=2):
27
  out_dir = tempfile.mkdtemp()
28
  # No skipping avoids failure detection, so we attempt to convert the full doc
29
+ # Batch size 2 is to match VRAM usage of marker
30
+ subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
31
  md_file = os.listdir(out_dir)[0]
32
  with open(os.path.join(out_dir, md_file), "r") as f:
33
  data = f.read()
 
41
  parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
42
  parser.add_argument("out_file", help="Output filename")
43
  parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
44
+ parser.add_argument("--nougat_batch_size", type=int, default=settings.NOUGAT_BATCH_SIZE, help="Batch size to use for nougat when making predictions.")
45
+ parser.add_argument("--marker_parallel", type=int, default=4, help="Number of marker CPU processes to run in parallel")
46
  parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
47
  args = parser.parse_args()
48
 
marker/benchmark/scoring.py CHANGED
@@ -1,8 +1,9 @@
1
- from rapidfuzz import fuzz
 
 
2
  import re
3
 
4
- CHUNK_SIZE = 128
5
- CHUNK_OVERLAP = 96
6
 
7
 
8
  def tokenize(text):
@@ -14,14 +15,9 @@ def tokenize(text):
14
  return flattened_result
15
 
16
 
17
- def chunk_text(text, overlap=False):
18
- tokens = tokenize(text)
19
- chunks = []
20
- step = CHUNK_SIZE
21
- if overlap:
22
- step -= CHUNK_OVERLAP
23
- for i in range(0, len(tokens), step):
24
- chunks.append(''.join(tokens[i:i+CHUNK_SIZE]))
25
  return chunks
26
 
27
 
@@ -29,22 +25,27 @@ def overlap_score(hypothesis_chunks, reference_chunks):
29
  length_modifier = len(hypothesis_chunks) / len(reference_chunks)
30
  search_distance = max(len(reference_chunks) // 5, 10)
31
  chunk_scores = []
 
32
  for i, hyp_chunk in enumerate(hypothesis_chunks):
33
  max_score = 0
 
34
  i_offset = int(i * length_modifier)
35
  chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
36
  for j in chunk_range:
37
  ref_chunk = reference_chunks[j]
38
- score = fuzz.ratio(hyp_chunk, ref_chunk)
39
  if score > max_score:
40
  max_score = score
 
41
  chunk_scores.append(max_score)
42
- return chunk_scores
 
 
43
 
44
 
45
  def score_text(hypothesis, reference):
46
  # Returns a 0-1 alignment score
47
  hypothesis_chunks = chunk_text(hypothesis)
48
  reference_chunks = chunk_text(reference)
49
- chunk_scores = overlap_score(hypothesis_chunks, reference_chunks)
50
- return sum(chunk_scores) / len(chunk_scores) / 100
 
1
+ import math
2
+
3
+ from rapidfuzz import fuzz, distance
4
  import re
5
 
6
+ CHUNK_MIN_CHARS = 25
 
7
 
8
 
9
  def tokenize(text):
 
15
  return flattened_result
16
 
17
 
18
+ def chunk_text(text):
19
+ chunks = text.split("\n")
20
+ chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
 
 
 
 
 
21
  return chunks
22
 
23
 
 
25
  length_modifier = len(hypothesis_chunks) / len(reference_chunks)
26
  search_distance = max(len(reference_chunks) // 5, 10)
27
  chunk_scores = []
28
+ chunk_weights = []
29
  for i, hyp_chunk in enumerate(hypothesis_chunks):
30
  max_score = 0
31
+ chunk_weight = 1
32
  i_offset = int(i * length_modifier)
33
  chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
34
  for j in chunk_range:
35
  ref_chunk = reference_chunks[j]
36
+ score = fuzz.ratio(hyp_chunk, ref_chunk) / 100
37
  if score > max_score:
38
  max_score = score
39
+ chunk_weight = math.sqrt(len(ref_chunk))
40
  chunk_scores.append(max_score)
41
+ chunk_weights.append(chunk_weight)
42
+ chunk_scores = [chunk_scores[i] * chunk_weights[i] for i in range(len(chunk_scores))]
43
+ return chunk_scores, chunk_weights
44
 
45
 
46
  def score_text(hypothesis, reference):
47
  # Returns a 0-1 alignment score
48
  hypothesis_chunks = chunk_text(hypothesis)
49
  reference_chunks = chunk_text(reference)
50
+ chunk_scores, chunk_weights = overlap_score(hypothesis_chunks, reference_chunks)
51
+ return sum(chunk_scores) / sum(chunk_weights)
marker/cleaners/equations.py CHANGED
@@ -23,7 +23,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
23
 
24
 
25
  def load_nougat_model():
26
- ckpt = get_checkpoint(None, model_tag="0.1.0-small")
27
  nougat_model = NougatModel.from_pretrained(ckpt)
28
  if settings.TORCH_DEVICE != "cpu":
29
  move_to_device(nougat_model, bf16=settings.CUDA, cuda=settings.CUDA)
@@ -94,7 +94,7 @@ def get_nougat_text_batched(images, reformat_region_lens, nougat_model):
94
  max_length += settings.NOUGAT_TOKEN_BUFFER
95
 
96
  nougat_model.config.max_length = max_length
97
- model_output = nougat_model.inference(image_tensors=sample)
98
  for j, output in enumerate(model_output["predictions"]):
99
  disclaimer = ""
100
  token_count = get_total_nougat_tokens(output, nougat_model)
 
23
 
24
 
25
  def load_nougat_model():
26
+ ckpt = get_checkpoint(None, model_tag=settings.NOUGAT_MODEL_NAME)
27
  nougat_model = NougatModel.from_pretrained(ckpt)
28
  if settings.TORCH_DEVICE != "cpu":
29
  move_to_device(nougat_model, bf16=settings.CUDA, cuda=settings.CUDA)
 
94
  max_length += settings.NOUGAT_TOKEN_BUFFER
95
 
96
  nougat_model.config.max_length = max_length
97
+ model_output = nougat_model.inference(image_tensors=sample, early_stopping=False)
98
  for j, output in enumerate(model_output["predictions"]):
99
  disclaimer = ""
100
  token_count = get_total_nougat_tokens(output, nougat_model)
marker/ordering.py CHANGED
@@ -12,11 +12,11 @@ import io
12
  from marker.schema import Page
13
  from marker.settings import settings
14
 
15
- processor = LayoutLMv3Processor.from_pretrained("vikp/column_detector")
16
 
17
 
18
  def load_ordering_model():
19
- model = LayoutLMv3ForSequenceClassification.from_pretrained("vikp/column_detector").to(settings.TORCH_DEVICE)
20
  if settings.CUDA:
21
  model = model.to(torch.bfloat16)
22
  return model
 
12
  from marker.schema import Page
13
  from marker.settings import settings
14
 
15
+ processor = LayoutLMv3Processor.from_pretrained(settings.ORDERER_MODEL_NAME)
16
 
17
 
18
  def load_ordering_model():
19
+ model = LayoutLMv3ForSequenceClassification.from_pretrained(settings.ORDERER_MODEL_NAME).to(settings.TORCH_DEVICE)
20
  if settings.CUDA:
21
  model = model.to(torch.bfloat16)
22
  return model
marker/segmentation.py CHANGED
@@ -17,14 +17,14 @@ from math import isclose
17
  # Otherwise some images can be truncated
18
  Image.MAX_IMAGE_PIXELS = None
19
 
20
- processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
21
 
22
  CHUNK_KEYS = ["input_ids", "attention_mask", "bbox", "offset_mapping"]
23
  NO_CHUNK_KEYS = ["pixel_values"]
24
 
25
 
26
  def load_layout_model():
27
- model = LayoutLMv3ForTokenClassification.from_pretrained("Kwan0/layoutlmv3-base-finetune-DocLayNet-100k").to(settings.TORCH_DEVICE)
28
  if settings.CUDA:
29
  model = model.to(torch.bfloat16)
30
 
 
17
  # Otherwise some images can be truncated
18
  Image.MAX_IMAGE_PIXELS = None
19
 
20
+ processor = LayoutLMv3Processor.from_pretrained(settings.LAYOUT_MODEL_NAME, apply_ocr=False)
21
 
22
  CHUNK_KEYS = ["input_ids", "attention_mask", "bbox", "offset_mapping"]
23
  NO_CHUNK_KEYS = ["pixel_values"]
24
 
25
 
26
  def load_layout_model():
27
+ model = LayoutLMv3ForTokenClassification.from_pretrained(settings.LAYOUT_MODEL_NAME).to(settings.TORCH_DEVICE)
28
  if settings.CUDA:
29
  model = model.to(torch.bfloat16)
30
 
marker/settings.py CHANGED
@@ -53,17 +53,19 @@ class Settings(BaseSettings):
53
  NOUGAT_HALLUCINATION_WORDS: List[str] = ["[MISSING_PAGE_POST]", "## References\n", "**Figure Captions**\n", "Footnote",
54
  "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]"]
55
  NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
56
- NOUGAT_MODEL_NAME: str = "facebook/nougat-small" # Name of the model to use
57
- NOUGAT_BATCH_SIZE: int = 4
58
 
59
  # Layout Model
60
  BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
61
  LAYOUT_MODEL_MAX: int = 512
62
  LAYOUT_CHUNK_OVERLAP: int = 64
63
  LAYOUT_DPI: int = 96
 
64
 
65
  # Ordering model
66
  ORDERER_BATCH_SIZE: int = 16 # This can be high, because max token count is 128
 
67
 
68
  # Ray
69
  RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache
 
53
  NOUGAT_HALLUCINATION_WORDS: List[str] = ["[MISSING_PAGE_POST]", "## References\n", "**Figure Captions**\n", "Footnote",
54
  "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]"]
55
  NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
56
+ NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
57
+ NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
58
 
59
  # Layout Model
60
  BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
61
  LAYOUT_MODEL_MAX: int = 512
62
  LAYOUT_CHUNK_OVERLAP: int = 64
63
  LAYOUT_DPI: int = 96
64
+ LAYOUT_MODEL_NAME: str = "vikp/layout_segmenter"
65
 
66
  # Ordering model
67
  ORDERER_BATCH_SIZE: int = 16 # This can be high, because max token count is 128
68
+ ORDERER_MODEL_NAME: str = "vikp/column_detector"
69
 
70
  # Ray
71
  RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache