Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Nov 27, 2023

Commit

a04b2ff

1 Parent(s): 48dfedd

Clean up benchmarking

Browse files

Files changed (7) hide show

README.md +27 -18
benchmark.py +5 -5
marker/benchmark/scoring.py +16 -15
marker/cleaners/equations.py +2 -2
marker/ordering.py +2 -2
marker/segmentation.py +2 -2
marker/settings.py +4 -2

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Marker
-Marker converts PDF, EPUB, and MOBI to Markdown.  It is up to 10x faster than nougat, works across many types of documents, and minimizes hallucinations. It runs on GPU or CPU, with configurable parallelism.
 Features:
@@ -9,19 +9,21 @@ Features:
 - Removal of headers/footers/other artifacts
 - Latex conversion for most equations
 - Proper code block and table formatting
-- Support for multiple languages (although most testing is done in English)
-- Works on GPU, CPU, or MPS (mac)
 ## How it works
 Marker is a pipeline of steps and deep learning models:
-- OCR if text cannot be detected
-- Detect page layout
-- Format blocks properly based on layout
 - Postprocess extracted text
-Marker minimizes autoregression, which reduces the risk of hallucinations to close to zero, and improves speed.  The only parts of the document that are passed through an LLM forward pass are equation regions.
 ## Limitations
@@ -46,12 +48,12 @@ This has been tested on Mac and Linux (Ubuntu).
 - Install the system requirements at `install/apt-requirements.txt` for Linux or `install/brew-requirements.txt` for Mac
   - Linux only: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/).  You may get tesseract 4 otherwise.
-  - Linux only: Install ghostscript > 9.55 (see `install/ghostscript_install.sh` for details).
 - Set the tesseract data folder path
   - Find the tesseract data folder `tessdata`
     - On mac, you can run `brew list tesseract`
-    - Or run `find / -name tessdata` to find it
-  - Create a `local.env` file with `TESSDATA_PREFIX=/path/to/tessdata` inside it
 **Python packages**
@@ -62,7 +64,8 @@ This has been tested on Mac and Linux (Ubuntu).
 **Configuration**
 - Set your torch device in the `local.env` file.  For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`.  `cpu` is the default.
-  - If using GPU, set `MAX_TASKS_PER_GPU` to your GPU VRAM (per GPU) divided by 1.5 GB.  For example, if you have 16 GB of VRAM, set `MAX_TASKS_PER_GPU=10`.
 - Inspect the settings in `marker/settings.py`.  You can override any settings in the `local.env` file, or by setting environment variables.
 ## Convert a single file
@@ -73,7 +76,7 @@ Run `convert_single.py`, like this:
 python convert_single.py /path/to/file.pdf /path/to/output.md --workers 4 --max_pages 10
 ```
-- `--workers` is the number of parallel processes to run.  This is set to 1 by default, but you can increase it to speed up processing, at the cost of more CPU/GPU usage.
 - `--max_pages` is the maximum number of pages to process.  Omit this to convert the entire document.
 Make sure the `DEFAULT_LANG` setting is set correctly for this.
@@ -86,14 +89,14 @@ Run `convert.py`, like this:
 python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
 ```
-- `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to speed up processing, at the cost of more CPU/GPU usage. This should not be higher than the `MAX_TASKS_PER_GPU` setting.
 - `--max` is the maximum number of pdfs to convert.  Omit this to convert all pdfs in the folder.
 - `--metadata_file` is an optional path to a json file with metadata about the pdfs.  If you provide it, it will be used to set the language for each pdf.  If not, `DEFAULT_LANG` will be used. The format is:
 ```
 {
   "pdf1.pdf": {"language": "English"},
-  "pdf2.pdf": {"language": "English"},
   ...
 }
 ```
@@ -107,15 +110,21 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
 ```
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
-- `NUM_DEVICES` is the number of GPUs to use.  Should be 2 or greater.
-- `NUM_WORKERS` is the number of parallel processes to run on each GPU.  This should not be higher than the `MAX_TASKS_PER_GPU` setting.
 ## Benchmark
-You can also benchmark the performance of the pipeline on your machine.  Run `benchmark.py`, like this:
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
 ```
-This will benchmark marker against other text extraction methods.  Omit `--nougat` to exclude nougat from the benchmark. (it will take longer)

 # Marker
+Marker converts PDF, EPUB, and MOBI to Markdown.  It is up to 10x faster than nougat, works across many types of documents, and minimizes the risk of hallucinations significantly.
 Features:
 - Removal of headers/footers/other artifacts
 - Latex conversion for most equations
 - Proper code block and table formatting
+- Support for multiple languages (although most testing is done in English).  See `settings.py` for a list of supported languages.
+- Works on GPU, CPU, or MPS
 ## How it works
 Marker is a pipeline of steps and deep learning models:
+- Loop through each document page, and:
+  - OCR the page if text cannot be detected
+  - Detect page layout
+  - Format blocks properly based on layout
+- Combine text from all pages
 - Postprocess extracted text
+Marker minimizes the use of autoregressive models, which reduces the risk of hallucinations to close to zero, and improves speed.  The only parts of a document that are passed through an LLM forward pass are equation blocks.
 ## Limitations
 - Install the system requirements at `install/apt-requirements.txt` for Linux or `install/brew-requirements.txt` for Mac
   - Linux only: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/).  You may get tesseract 4 otherwise.
+  - Linux only: Install ghostscript > 9.55 (see `install/ghostscript_install.sh` for the commands).
 - Set the tesseract data folder path
   - Find the tesseract data folder `tessdata`
     - On mac, you can run `brew list tesseract`
+    - On linux, run `find / -name tessdata`
+  - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
 **Python packages**
 **Configuration**
 - Set your torch device in the `local.env` file.  For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`.  `cpu` is the default.
+  - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU).  For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
+  - Depending on your document types, marker's average memory usage per task can vary.  You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
 - Inspect the settings in `marker/settings.py`.  You can override any settings in the `local.env` file, or by setting environment variables.
 ## Convert a single file
 python convert_single.py /path/to/file.pdf /path/to/output.md --workers 4 --max_pages 10
 ```
+- `--workers` is the number of parallel CPU processes to run for OCR.  This is set to 1 by default, but you can increase it to speed up processing.
 - `--max_pages` is the maximum number of pages to process.  Omit this to convert the entire document.
 Make sure the `DEFAULT_LANG` setting is set correctly for this.
 python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
 ```
+- `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
 - `--max` is the maximum number of pdfs to convert.  Omit this to convert all pdfs in the folder.
 - `--metadata_file` is an optional path to a json file with metadata about the pdfs.  If you provide it, it will be used to set the language for each pdf.  If not, `DEFAULT_LANG` will be used. The format is:
 ```
 {
   "pdf1.pdf": {"language": "English"},
+  "pdf2.pdf": {"language": "Spanish"},
   ...
 }
 ```
 ```
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
+- `NUM_DEVICES` is the number of GPUs to use.  Should be `2` or greater.
+- `NUM_WORKERS` is the number of parallel processes to run on each GPU.  Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
 ## Benchmark
+You can benchmark the performance of marker on your machine.  Run `benchmark.py`, like this:
 ```
 python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
 ```
+This will benchmark marker against other text extraction methods.  It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each (4GB).
+Omit `--nougat` to exclude nougat from the benchmark.  I don't recommend running nougat on CPU, since it is very slow.
+# Commercial usage
+Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.  I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.

benchmark.py CHANGED Viewed

@@ -23,11 +23,11 @@ from tabulate import tabulate
 configure_logging()
-def nougat_prediction(pdf_filename, batch_size=1):
     out_dir = tempfile.mkdtemp()
     # No skipping avoids failure detection, so we attempt to convert the full doc
-    # Batch size 1 is to compare to single-threaded marker
-    subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--batchsize", str(batch_size)], check=True)
     md_file = os.listdir(out_dir)[0]
     with open(os.path.join(out_dir, md_file), "r") as f:
         data = f.read()
@@ -41,8 +41,8 @@ if __name__ == "__main__":
     parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
     parser.add_argument("out_file", help="Output filename")
     parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
-    parser.add_argument("--nougat_batch_size", type=int, default=4, help="Batch size to use when making predictions")
-    parser.add_argument("--marker_parallel", type=int, default=4, help="Number of marker processes to run in parallel")
     parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
     args = parser.parse_args()

 configure_logging()
+def nougat_prediction(pdf_filename, batch_size=2):
     out_dir = tempfile.mkdtemp()
     # No skipping avoids failure detection, so we attempt to convert the full doc
+    # Batch size 2 is to match VRAM usage of marker
+    subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
     md_file = os.listdir(out_dir)[0]
     with open(os.path.join(out_dir, md_file), "r") as f:
         data = f.read()
     parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
     parser.add_argument("out_file", help="Output filename")
     parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
+    parser.add_argument("--nougat_batch_size", type=int, default=settings.NOUGAT_BATCH_SIZE, help="Batch size to use for nougat when making predictions.")
+    parser.add_argument("--marker_parallel", type=int, default=4, help="Number of marker CPU processes to run in parallel")
     parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
     args = parser.parse_args()

marker/benchmark/scoring.py CHANGED Viewed

@@ -1,8 +1,9 @@
-from rapidfuzz import fuzz
 import re
-CHUNK_SIZE = 128
-CHUNK_OVERLAP = 96
 def tokenize(text):
@@ -14,14 +15,9 @@ def tokenize(text):
     return flattened_result
-def chunk_text(text, overlap=False):
-    tokens = tokenize(text)
-    chunks = []
-    step = CHUNK_SIZE
-    if overlap:
-        step -= CHUNK_OVERLAP
-    for i in range(0, len(tokens), step):
-        chunks.append(''.join(tokens[i:i+CHUNK_SIZE]))
     return chunks
@@ -29,22 +25,27 @@ def overlap_score(hypothesis_chunks, reference_chunks):
     length_modifier = len(hypothesis_chunks) / len(reference_chunks)
     search_distance = max(len(reference_chunks) // 5, 10)
     chunk_scores = []
     for i, hyp_chunk in enumerate(hypothesis_chunks):
         max_score = 0
         i_offset = int(i * length_modifier)
         chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
         for j in chunk_range:
             ref_chunk = reference_chunks[j]
-            score = fuzz.ratio(hyp_chunk, ref_chunk)
             if score > max_score:
                 max_score = score
         chunk_scores.append(max_score)
-    return chunk_scores
 def score_text(hypothesis, reference):
     # Returns a 0-1 alignment score
     hypothesis_chunks = chunk_text(hypothesis)
     reference_chunks = chunk_text(reference)
-    chunk_scores = overlap_score(hypothesis_chunks, reference_chunks)
-    return sum(chunk_scores) / len(chunk_scores) / 100

+import math
+from rapidfuzz import fuzz, distance
 import re
+CHUNK_MIN_CHARS = 25
 def tokenize(text):
     return flattened_result
+def chunk_text(text):
+    chunks = text.split("\n")
+    chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
     return chunks
     length_modifier = len(hypothesis_chunks) / len(reference_chunks)
     search_distance = max(len(reference_chunks) // 5, 10)
     chunk_scores = []
+    chunk_weights = []
     for i, hyp_chunk in enumerate(hypothesis_chunks):
         max_score = 0
+        chunk_weight = 1
         i_offset = int(i * length_modifier)
         chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
         for j in chunk_range:
             ref_chunk = reference_chunks[j]
+            score = fuzz.ratio(hyp_chunk, ref_chunk) / 100
             if score > max_score:
                 max_score = score
+                chunk_weight = math.sqrt(len(ref_chunk))
         chunk_scores.append(max_score)
+        chunk_weights.append(chunk_weight)
+    chunk_scores = [chunk_scores[i] * chunk_weights[i] for i in range(len(chunk_scores))]
+    return chunk_scores, chunk_weights
 def score_text(hypothesis, reference):
     # Returns a 0-1 alignment score
     hypothesis_chunks = chunk_text(hypothesis)
     reference_chunks = chunk_text(reference)
+    chunk_scores, chunk_weights = overlap_score(hypothesis_chunks, reference_chunks)
+    return sum(chunk_scores) / sum(chunk_weights)

marker/cleaners/equations.py CHANGED Viewed

@@ -23,7 +23,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
 def load_nougat_model():
-    ckpt = get_checkpoint(None, model_tag="0.1.0-small")
     nougat_model = NougatModel.from_pretrained(ckpt)
     if settings.TORCH_DEVICE != "cpu":
         move_to_device(nougat_model, bf16=settings.CUDA, cuda=settings.CUDA)
@@ -94,7 +94,7 @@ def get_nougat_text_batched(images, reformat_region_lens, nougat_model):
         max_length += settings.NOUGAT_TOKEN_BUFFER
         nougat_model.config.max_length = max_length
-        model_output = nougat_model.inference(image_tensors=sample)
         for j, output in enumerate(model_output["predictions"]):
             disclaimer = ""
             token_count = get_total_nougat_tokens(output, nougat_model)

 def load_nougat_model():
+    ckpt = get_checkpoint(None, model_tag=settings.NOUGAT_MODEL_NAME)
     nougat_model = NougatModel.from_pretrained(ckpt)
     if settings.TORCH_DEVICE != "cpu":
         move_to_device(nougat_model, bf16=settings.CUDA, cuda=settings.CUDA)
         max_length += settings.NOUGAT_TOKEN_BUFFER
         nougat_model.config.max_length = max_length
+        model_output = nougat_model.inference(image_tensors=sample, early_stopping=False)
         for j, output in enumerate(model_output["predictions"]):
             disclaimer = ""
             token_count = get_total_nougat_tokens(output, nougat_model)

marker/ordering.py CHANGED Viewed

@@ -12,11 +12,11 @@ import io
 from marker.schema import Page
 from marker.settings import settings
-processor = LayoutLMv3Processor.from_pretrained("vikp/column_detector")
 def load_ordering_model():
-    model = LayoutLMv3ForSequenceClassification.from_pretrained("vikp/column_detector").to(settings.TORCH_DEVICE)
     if settings.CUDA:
         model = model.to(torch.bfloat16)
     return model

 from marker.schema import Page
 from marker.settings import settings
+processor = LayoutLMv3Processor.from_pretrained(settings.ORDERER_MODEL_NAME)
 def load_ordering_model():
+    model = LayoutLMv3ForSequenceClassification.from_pretrained(settings.ORDERER_MODEL_NAME).to(settings.TORCH_DEVICE)
     if settings.CUDA:
         model = model.to(torch.bfloat16)
     return model

marker/segmentation.py CHANGED Viewed

@@ -17,14 +17,14 @@ from math import isclose
 # Otherwise some images can be truncated
 Image.MAX_IMAGE_PIXELS = None
-processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
 CHUNK_KEYS = ["input_ids", "attention_mask", "bbox", "offset_mapping"]
 NO_CHUNK_KEYS = ["pixel_values"]
 def load_layout_model():
-    model = LayoutLMv3ForTokenClassification.from_pretrained("Kwan0/layoutlmv3-base-finetune-DocLayNet-100k").to(settings.TORCH_DEVICE)
     if settings.CUDA:
         model = model.to(torch.bfloat16)

 # Otherwise some images can be truncated
 Image.MAX_IMAGE_PIXELS = None
+processor = LayoutLMv3Processor.from_pretrained(settings.LAYOUT_MODEL_NAME, apply_ocr=False)
 CHUNK_KEYS = ["input_ids", "attention_mask", "bbox", "offset_mapping"]
 NO_CHUNK_KEYS = ["pixel_values"]
 def load_layout_model():
+    model = LayoutLMv3ForTokenClassification.from_pretrained(settings.LAYOUT_MODEL_NAME).to(settings.TORCH_DEVICE)
     if settings.CUDA:
         model = model.to(torch.bfloat16)

marker/settings.py CHANGED Viewed

@@ -53,17 +53,19 @@ class Settings(BaseSettings):
     NOUGAT_HALLUCINATION_WORDS: List[str] = ["[MISSING_PAGE_POST]", "## References\n", "**Figure Captions**\n", "Footnote",
                                   "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]"]
     NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
-    NOUGAT_MODEL_NAME: str = "facebook/nougat-small" # Name of the model to use
-    NOUGAT_BATCH_SIZE: int = 4
     # Layout Model
     BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
     LAYOUT_MODEL_MAX: int = 512
     LAYOUT_CHUNK_OVERLAP: int = 64
     LAYOUT_DPI: int = 96
     # Ordering model
     ORDERER_BATCH_SIZE: int = 16 # This can be high, because max token count is 128
     # Ray
     RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache

     NOUGAT_HALLUCINATION_WORDS: List[str] = ["[MISSING_PAGE_POST]", "## References\n", "**Figure Captions**\n", "Footnote",
                                   "\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]"]
     NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
+    NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
+    NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
     # Layout Model
     BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
     LAYOUT_MODEL_MAX: int = 512
     LAYOUT_CHUNK_OVERLAP: int = 64
     LAYOUT_DPI: int = 96
+    LAYOUT_MODEL_NAME: str = "vikp/layout_segmenter"
     # Ordering model
     ORDERER_BATCH_SIZE: int = 16 # This can be high, because max token count is 128
+    ORDERER_MODEL_NAME: str = "vikp/column_detector"
     # Ray
     RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache