Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Nov 30, 2023

Commit

e704210

1 Parent(s): ec69c20

Add benchmark results

Browse files

Files changed (7) hide show

README.md +39 -26
convert.py +0 -4
marker/cleaners/table.py +1 -1
marker/ordering.py +5 -3
marker/postprocessors/editor.py +8 -16
marker/segmentation.py +4 -3
marker/settings.py +10 -4

README.md CHANGED Viewed

@@ -1,12 +1,12 @@
 # Marker
-Marker converts PDF, EPUB, and MOBI to markdown.  It's 10x faster than nougat, more accurate on most documents, and has near-zero hallucination risk.
 - Support for a range of PDF documents (optimized for books and scientific papers)
 - Removes headers/footers/other artifacts
 - Converts most equations to latex
 - Formats code blocks and tables
-- Support for multiple languages (although most testing is done in English).  See `settings.py` for a list of supported languages.
 - Works on GPU, CPU, or MPS
 ## How it works
@@ -16,23 +16,29 @@ Marker is a pipeline of deep learning models:
 - Extract text, OCR if necessary (heuristics, tesseract)
 - Detect page layout ([layout segmenter](https://huggingface.co/vikp/layout_segmenter), [column detector](https://huggingface.co/vikp/column_detector))
 - Clean and format each block (heuristics, [nougat](https://huggingface.co/facebook/nougat-base))
-- Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor))
-Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition.  From the nougat paper `We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.`  In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.  Nougat is an amazing model that is part of marker, it's just not a general-purpose converter.
-Marker is 10x faster and more accurate by only passing equation blocks through an LLM forward pass.
 ## Examples
-| PDF                                                                   | Type        | Marker                                                                                                 | Nougat                                                                                            |
-|-----------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
-| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook    | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/thinkpython.md)         | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/thinkpython.md)    |
-| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf)             | Textbook    | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/thinkos.md)             | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/thinkos.md)        |
-| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf)           | arXiv paper | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/switch_transformers.md) |
-| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf)              | arXiv paper | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/marker/multicolcnn.md)         | [View file](https://github.com/VikParuchuri/marker/blob/master/examples/nougat/multicolcnn.md)    |
-See [below](#benchmarks) for speed and accuracy benchmarks.
 # Installation
@@ -71,11 +77,12 @@ First, clone the repo:
 # Usage
-**Configuration**
 - Set your torch device in the `local.env` file.  For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`.  `cpu` is the default.
   - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU).  For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
   - Depending on your document types, marker's average memory usage per task can vary slightly.  You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
 - Inspect the settings in `marker/settings.py`.  You can override any settings in the `local.env` file, or by setting environment variables.
 ## Convert a single file
@@ -96,7 +103,7 @@ Make sure the `DEFAULT_LANG` setting is set appropriately for your document.
 Run `convert.py`, like this:
 ```
-python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
 ```
 - `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
@@ -116,7 +123,7 @@ python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max
 Run `chunk_convert.sh`, like this:
 ```
-METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.sh ../pdf_in ../md_out
 ```
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
@@ -127,28 +134,34 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
 Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.
-Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).
 **Speed**
-Method      Average Score    Time per doc
---------  ---------------  --------------
-naive            0.351585        0.328931
-marker           0.636839       78.1468
-nougat           0.614548      810.756
 **Accuracy**
 First 3 are non-arXiv books, last 3 are arXiv papers.
-Method      thinkos.pdf    thinkdsp.pdf    thinkpython.pdf   switch_trans.pdf    crowd.pdf    multicolcnn.pdf
---------  -------------  --------------  -----------------  ------------------  -----------  -----------------
-naive          0.366817        0.412014           0.468147            0.244739     0.14489          0.0890217
-marker         0.753291        0.787938           0.779262            0.478387     0.446068          0.533737
-nougat         0.638434        0.632723           0.637626            0.690028     0.540994          0.699539
 Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker.  Benchmarks were run on an A6000.
 ## Running your own benchmarks
 You can benchmark the performance of marker on your machine.  First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.

 # Marker
+Marker converts PDF, EPUB, and MOBI to markdown.  It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
 - Support for a range of PDF documents (optimized for books and scientific papers)
 - Removes headers/footers/other artifacts
 - Converts most equations to latex
 - Formats code blocks and tables
+- Support for multiple languages (although most testing is done in English).  See `settings.py` for a language list.
 - Works on GPU, CPU, or MPS
 ## How it works
 - Extract text, OCR if necessary (heuristics, tesseract)
 - Detect page layout ([layout segmenter](https://huggingface.co/vikp/layout_segmenter), [column detector](https://huggingface.co/vikp/column_detector))
 - Clean and format each block (heuristics, [nougat](https://huggingface.co/facebook/nougat-base))
+- Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor_t5))
+Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition.  From the nougat paper: `We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.`  In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.
+Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.
 ## Examples
+| PDF                                                                   | Type        | Marker                                                                                                 | Nougat                                                                                                 |
+|-----------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook    | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython.md)         | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkpython.md)         |
+| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf)             | Textbook    | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkos.md)             | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkos.md)             |
+| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf)           | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
+| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf)              | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/multicolcnn.md)         | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md)         |
+## Performance
+![Benchmark overall](data/images/overall.png)
+The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.
+See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
 # Installation
 # Usage
+First, some configuration:
 - Set your torch device in the `local.env` file.  For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`.  `cpu` is the default.
   - If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU).  For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
   - Depending on your document types, marker's average memory usage per task can vary slightly.  You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
+- By default, the final editor model is off.  Turn it on with `ENABLE_EDITOR_MODEL`.
 - Inspect the settings in `marker/settings.py`.  You can override any settings in the `local.env` file, or by setting environment variables.
 ## Convert a single file
 Run `convert.py`, like this:
 ```
+python convert.py /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json
 ```
 - `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
 Run `chunk_convert.sh`, like this:
 ```
+METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 bash chunk_convert.sh ../pdf_in ../md_out
 ```
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
 Benchmarking PDF extraction quality is hard.  I've created a test set by finding books and scientific papers that have a pdf version and a latex source.  I convert the latex to text, and compare the reference to the output of text extraction methods.
+Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data).  We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
 **Speed**
+| Method | Average Score | Time per page | Time per document |
+|--------|---------------|---------------|-------------------|
+| naive  | 0.350727      | 0.00152378    | 0.326524          |
+| marker | 0.641062      | 0.360622      | 77.2762           |
+| nougat | 0.629211      | 3.77259       | 808.413           |
 **Accuracy**
 First 3 are non-arXiv books, last 3 are arXiv papers.
+| Method | switch_trans.pdf | crowd.pdf | multicolcnn.pdf | thinkos.pdf | thinkdsp.pdf | thinkpython.pdf |
+|--------|------------------|-----------|-----------------|-------------|--------------|-----------------|
+| naive  | 0.244114         | 0.140669  | 0.0868221       | 0.366856    | 0.412521     | 0.468281        |
+| marker | 0.482091         | 0.466882  | 0.537062        | 0.754347    | 0.78825      | 0.779536        |
+| nougat | 0.696458         | 0.552337  | 0.735099        | 0.655002    | 0.645704     | 0.650282        |
 Peak GPU memory usage during the benchmark is `3.3GB` for nougat, and `3.1GB` for marker.  Benchmarks were run on an A6000.
+**Throughput**
+Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.
+![Benchmark results](data/images/per_doc.png)
 ## Running your own benchmarks
 You can benchmark the performance of marker on your machine.  First, download the benchmark data [here](https://drive.google.com/file/d/1WiN4K2-jQfwyQMe4wSSurbpz3hxo2fG9/view?usp=drive_link) and unzip.

convert.py CHANGED Viewed

@@ -3,15 +3,11 @@ import os
 from typing import Dict
 import ray
-import torch
 from tqdm import tqdm
 import math
 from marker.convert import convert_single_pdf, get_length_of_text
 from marker.models import load_all_models
-from marker.ordering import load_ordering_model
-from marker.segmentation import load_layout_model
-from marker.cleaners.equations import load_nougat_model
 from marker.settings import settings
 from marker.logger import configure_logging
 import traceback

 from typing import Dict
 import ray
 from tqdm import tqdm
 import math
 from marker.convert import convert_single_pdf, get_length_of_text
 from marker.models import load_all_models
 from marker.settings import settings
 from marker.logger import configure_logging
 import traceback

marker/cleaners/table.py CHANGED Viewed

@@ -80,7 +80,7 @@ def create_new_tables(blocks: List[Page]):
             if max([len("".join(r)) for r in table_rows]) > 300 or len(table_rows[0]) > 8:
                 continue
-            new_text = tabulate(table_rows, headers="firstrow", tablefmt="simple")
             new_span = Span(
                 bbox=block.bbox,
                 span_id=f"{table_idx}_fix_table",

             if max([len("".join(r)) for r in table_rows]) > 300 or len(table_rows[0]) > 8:
                 continue
+            new_text = tabulate(table_rows, headers="firstrow", tablefmt="github")
             new_span = Span(
                 bbox=block.bbox,
                 span_id=f"{table_idx}_fix_table",

marker/ordering.py CHANGED Viewed

@@ -16,9 +16,11 @@ processor = LayoutLMv3Processor.from_pretrained(settings.ORDERER_MODEL_NAME)
 def load_ordering_model():
-    model = LayoutLMv3ForSequenceClassification.from_pretrained(settings.ORDERER_MODEL_NAME).to(settings.TORCH_DEVICE)
-    if settings.CUDA:
-        model = model.to(torch.bfloat16)
     return model

 def load_ordering_model():
+    model = LayoutLMv3ForSequenceClassification.from_pretrained(
+        settings.ORDERER_MODEL_NAME,
+        torch_dtype=settings.MODEL_DTYPE,
+    ).to(settings.TORCH_DEVICE)
+    model.eval()
     return model

marker/postprocessors/editor.py CHANGED Viewed

@@ -1,7 +1,6 @@
 from collections import defaultdict, Counter
 from itertools import chain
 from typing import Optional
-import re
 from transformers import AutoTokenizer
 from marker.settings import settings
@@ -17,11 +16,10 @@ def load_editing_model():
         return None
     model = T5ForTokenClassification.from_pretrained(
-            settings.EDITOR_MODEL_NAME
         ).to(settings.TORCH_DEVICE)
-    if settings.CUDA:
-        model = model.to(torch.bfloat16)
     model.config.label2id = {
         "equal": 0,
@@ -41,15 +39,6 @@ def edit_full_text(text: str, model: Optional[T5ForTokenClassification], batch_s
     input_ids = tokenized["input_ids"]
     char_token_lengths = tokenized["char_token_lengths"]
-    # Tokenize, and make sure reverse tokenization works
-    model_tokens = [tokenizer.convert_ids_to_tokens(t, skip_special_tokens=True) for t in input_ids]
-    model_tokens_str = [tokenizer.convert_tokens_to_string(t) for t in model_tokens]
-    full_text = "".join(model_tokens_str)
-    assert full_text == text
-    # List of characters in the text
-    flat_input_ids = list(chain.from_iterable(input_ids))
     # Run model
     token_masks = []
     for i in range(0, len(input_ids), batch_size):
@@ -67,14 +56,17 @@ def edit_full_text(text: str, model: Optional[T5ForTokenClassification], batch_s
         probs = F.softmax(logits, dim=-1)
         max_prob = torch.max(probs, dim=-1)
         cutoff_prob = max_prob.values < settings.EDITOR_CUTOFF_THRESH
-        labels = logits.argmax(-1).squeeze()
         labels[cutoff_prob] = model.config.label2id["equal"]
-        labels = labels.tolist()
         if len(labels) == settings.EDITOR_MAX_LENGTH:
             labels = [labels]
         labels = list(chain.from_iterable(labels))
         token_masks.extend(labels)
     # Strip special tokens 0,1.  Keep unknown token, although it should never be used
     assert len(token_masks) == len(flat_input_ids)
     token_masks = [mask for mask, token in zip(token_masks, flat_input_ids) if token >= 2]

 from collections import defaultdict, Counter
 from itertools import chain
 from typing import Optional
 from transformers import AutoTokenizer
 from marker.settings import settings
         return None
     model = T5ForTokenClassification.from_pretrained(
+            settings.EDITOR_MODEL_NAME,
+            torch_dtype=settings.MODEL_DTYPE,
         ).to(settings.TORCH_DEVICE)
+    model.eval()
     model.config.label2id = {
         "equal": 0,
     input_ids = tokenized["input_ids"]
     char_token_lengths = tokenized["char_token_lengths"]
     # Run model
     token_masks = []
     for i in range(0, len(input_ids), batch_size):
         probs = F.softmax(logits, dim=-1)
         max_prob = torch.max(probs, dim=-1)
         cutoff_prob = max_prob.values < settings.EDITOR_CUTOFF_THRESH
+        labels = logits.argmax(-1)
         labels[cutoff_prob] = model.config.label2id["equal"]
+        labels = labels.squeeze().tolist()
         if len(labels) == settings.EDITOR_MAX_LENGTH:
             labels = [labels]
         labels = list(chain.from_iterable(labels))
         token_masks.extend(labels)
+    # List of characters in the text
+    flat_input_ids = list(chain.from_iterable(input_ids))
     # Strip special tokens 0,1.  Keep unknown token, although it should never be used
     assert len(token_masks) == len(flat_input_ids)
     token_masks = [mask for mask, token in zip(token_masks, flat_input_ids) if token >= 2]

marker/segmentation.py CHANGED Viewed

@@ -24,9 +24,10 @@ NO_CHUNK_KEYS = ["pixel_values"]
 def load_layout_model():
-    model = LayoutLMv3ForTokenClassification.from_pretrained(settings.LAYOUT_MODEL_NAME).to(settings.TORCH_DEVICE)
-    if settings.CUDA:
-        model = model.to(torch.bfloat16)
     model.config.id2label = {
         0: "Caption",

 def load_layout_model():
+    model = LayoutLMv3ForTokenClassification.from_pretrained(
+        settings.LAYOUT_MODEL_NAME,
+        torch_dtype=settings.MODEL_DTYPE,
+    ).to(settings.TORCH_DEVICE)
     model.config.id2label = {
         0: "Caption",

marker/settings.py CHANGED Viewed

@@ -5,13 +5,14 @@ from dotenv import find_dotenv
 from pydantic import computed_field
 from pydantic_settings import BaseSettings
 import fitz as pymupdf
 class Settings(BaseSettings):
     # General
     TORCH_DEVICE: str = "cpu"
     INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
-    VRAM_PER_TASK: float = 2.5 # How much VRAM to allocate per task (in GB).  Peak marker VRAM usage is around 3GB, but avg across workers is lower.
     DEBUG: bool = False # Enable debug logging
     DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
@@ -73,10 +74,10 @@ class Settings(BaseSettings):
     # Final editing model
     EDITOR_BATCH_SIZE: int = 4
-    EDITOR_MAX_LENGTH: int = 2048
     EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor_t5"
-    ENABLE_EDITOR_MODEL: bool = True # The editor model can create false positives
-    EDITOR_CUTOFF_THRESH: float = 0.75 # Ignore predictions below this probability
     # Ray
     RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache
@@ -88,6 +89,11 @@ class Settings(BaseSettings):
     def CUDA(self) -> bool:
         return "cuda" in self.TORCH_DEVICE
     class Config:
         env_file = find_dotenv("local.env")
         extra = "ignore"

 from pydantic import computed_field
 from pydantic_settings import BaseSettings
 import fitz as pymupdf
+import torch
 class Settings(BaseSettings):
     # General
     TORCH_DEVICE: str = "cpu"
     INFERENCE_RAM: int = 40 # How much VRAM each GPU has (in GB).
+    VRAM_PER_TASK: float = 2 # How much VRAM to allocate per task (in GB).  Peak marker VRAM usage is around 3GB, but avg across workers is lower.
     DEBUG: bool = False # Enable debug logging
     DEFAULT_LANG: str = "English" # Default language we assume files to be in, should be one of the keys in TESSERACT_LANGUAGES
     # Final editing model
     EDITOR_BATCH_SIZE: int = 4
+    EDITOR_MAX_LENGTH: int = 1024
     EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor_t5"
+    ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
+    EDITOR_CUTOFF_THRESH: float = 0.9 # Ignore predictions below this probability
     # Ray
     RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache
     def CUDA(self) -> bool:
         return "cuda" in self.TORCH_DEVICE
+    @computed_field
+    @property
+    def MODEL_DTYPE(self) -> torch.dtype:
+        return torch.bfloat16 if self.CUDA else torch.float32
     class Config:
         env_file = find_dotenv("local.env")
         extra = "ignore"