Vik Paruchuri
commited on
Commit
·
a04b2ff
1
Parent(s):
48dfedd
Clean up benchmarking
Browse files- README.md +27 -18
- benchmark.py +5 -5
- marker/benchmark/scoring.py +16 -15
- marker/cleaners/equations.py +2 -2
- marker/ordering.py +2 -2
- marker/segmentation.py +2 -2
- marker/settings.py +4 -2
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Marker
|
| 2 |
|
| 3 |
-
Marker converts PDF, EPUB, and MOBI to Markdown. It is up to 10x faster than nougat, works across many types of documents, and minimizes
|
| 4 |
|
| 5 |
Features:
|
| 6 |
|
|
@@ -9,19 +9,21 @@ Features:
|
|
| 9 |
- Removal of headers/footers/other artifacts
|
| 10 |
- Latex conversion for most equations
|
| 11 |
- Proper code block and table formatting
|
| 12 |
-
- Support for multiple languages (although most testing is done in English)
|
| 13 |
-
- Works on GPU, CPU, or MPS
|
| 14 |
|
| 15 |
## How it works
|
| 16 |
|
| 17 |
Marker is a pipeline of steps and deep learning models:
|
| 18 |
|
| 19 |
-
-
|
| 20 |
-
-
|
| 21 |
-
-
|
|
|
|
|
|
|
| 22 |
- Postprocess extracted text
|
| 23 |
|
| 24 |
-
Marker minimizes
|
| 25 |
|
| 26 |
## Limitations
|
| 27 |
|
|
@@ -46,12 +48,12 @@ This has been tested on Mac and Linux (Ubuntu).
|
|
| 46 |
|
| 47 |
- Install the system requirements at `install/apt-requirements.txt` for Linux or `install/brew-requirements.txt` for Mac
|
| 48 |
- Linux only: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/). You may get tesseract 4 otherwise.
|
| 49 |
-
- Linux only: Install ghostscript > 9.55 (see `install/ghostscript_install.sh` for
|
| 50 |
- Set the tesseract data folder path
|
| 51 |
- Find the tesseract data folder `tessdata`
|
| 52 |
- On mac, you can run `brew list tesseract`
|
| 53 |
-
-
|
| 54 |
-
- Create a `local.env` file with `TESSDATA_PREFIX=/path/to/tessdata` inside it
|
| 55 |
|
| 56 |
**Python packages**
|
| 57 |
|
|
@@ -62,7 +64,8 @@ This has been tested on Mac and Linux (Ubuntu).
|
|
| 62 |
**Configuration**
|
| 63 |
|
| 64 |
- Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
|
| 65 |
-
- If using GPU, set `
|
|
|
|
| 66 |
- Inspect the settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
|
| 67 |
|
| 68 |
## Convert a single file
|
|
@@ -73,7 +76,7 @@ Run `convert_single.py`, like this:
|
|
| 73 |
python convert_single.py /path/to/file.pdf /path/to/output.md --workers 4 --max_pages 10
|
| 74 |
```
|
| 75 |
|
| 76 |
-
- `--workers` is the number of parallel processes to run. This is set to 1 by default, but you can increase it to speed up processing
|
| 77 |
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
|
| 78 |
|
| 79 |
Make sure the `DEFAULT_LANG` setting is set correctly for this.
|
|
@@ -86,14 +89,14 @@ Run `convert.py`, like this:
|
|
| 86 |
python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
|
| 87 |
```
|
| 88 |
|
| 89 |
-
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to
|
| 90 |
- `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
|
| 91 |
- `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, `DEFAULT_LANG` will be used. The format is:
|
| 92 |
|
| 93 |
```
|
| 94 |
{
|
| 95 |
"pdf1.pdf": {"language": "English"},
|
| 96 |
-
"pdf2.pdf": {"language": "
|
| 97 |
...
|
| 98 |
}
|
| 99 |
```
|
|
@@ -107,15 +110,21 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=35 bash chunk_convert.s
|
|
| 107 |
```
|
| 108 |
|
| 109 |
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
|
| 110 |
-
- `NUM_DEVICES` is the number of GPUs to use. Should be 2 or greater.
|
| 111 |
-
- `NUM_WORKERS` is the number of parallel processes to run on each GPU.
|
| 112 |
|
| 113 |
## Benchmark
|
| 114 |
|
| 115 |
-
You can
|
| 116 |
|
| 117 |
```
|
| 118 |
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
|
| 119 |
```
|
| 120 |
|
| 121 |
-
This will benchmark marker against other text extraction methods.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Marker
|
| 2 |
|
| 3 |
+
Marker converts PDF, EPUB, and MOBI to Markdown. It is up to 10x faster than nougat, works across many types of documents, and minimizes the risk of hallucinations significantly.
|
| 4 |
|
| 5 |
Features:
|
| 6 |
|
|
|
|
| 9 |
- Removal of headers/footers/other artifacts
|
| 10 |
- Latex conversion for most equations
|
| 11 |
- Proper code block and table formatting
|
| 12 |
+
- Support for multiple languages (although most testing is done in English). See `settings.py` for a list of supported languages.
|
| 13 |
+
- Works on GPU, CPU, or MPS
|
| 14 |
|
| 15 |
## How it works
|
| 16 |
|
| 17 |
Marker is a pipeline of steps and deep learning models:
|
| 18 |
|
| 19 |
+
- Loop through each document page, and:
|
| 20 |
+
- OCR the page if text cannot be detected
|
| 21 |
+
- Detect page layout
|
| 22 |
+
- Format blocks properly based on layout
|
| 23 |
+
- Combine text from all pages
|
| 24 |
- Postprocess extracted text
|
| 25 |
|
| 26 |
+
Marker minimizes the use of autoregressive models, which reduces the risk of hallucinations to close to zero, and improves speed. The only parts of a document that are passed through an LLM forward pass are equation blocks.
|
| 27 |
|
| 28 |
## Limitations
|
| 29 |
|
|
|
|
| 48 |
|
| 49 |
- Install the system requirements at `install/apt-requirements.txt` for Linux or `install/brew-requirements.txt` for Mac
|
| 50 |
- Linux only: Install tesseract 5 by following [these instructions](https://notesalexp.org/tesseract-ocr/html/). You may get tesseract 4 otherwise.
|
| 51 |
+
- Linux only: Install ghostscript > 9.55 (see `install/ghostscript_install.sh` for the commands).
|
| 52 |
- Set the tesseract data folder path
|
| 53 |
- Find the tesseract data folder `tessdata`
|
| 54 |
- On mac, you can run `brew list tesseract`
|
| 55 |
+
- On linux, run `find / -name tessdata`
|
| 56 |
+
- Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
|
| 57 |
|
| 58 |
**Python packages**
|
| 59 |
|
|
|
|
| 64 |
**Configuration**
|
| 65 |
|
| 66 |
- Set your torch device in the `local.env` file. For example, `TORCH_DEVICE=cuda` or `TORCH_DEVICE=mps`. `cpu` is the default.
|
| 67 |
+
- If using GPU, set `INFERENCE_RAM` to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, set `INFERENCE_RAM=16`.
|
| 68 |
+
- Depending on your document types, marker's average memory usage per task can vary. You can configure `VRAM_PER_TASK` to adjust this if you notice tasks failing with GPU out of memory errors.
|
| 69 |
- Inspect the settings in `marker/settings.py`. You can override any settings in the `local.env` file, or by setting environment variables.
|
| 70 |
|
| 71 |
## Convert a single file
|
|
|
|
| 76 |
python convert_single.py /path/to/file.pdf /path/to/output.md --workers 4 --max_pages 10
|
| 77 |
```
|
| 78 |
|
| 79 |
+
- `--workers` is the number of parallel CPU processes to run for OCR. This is set to 1 by default, but you can increase it to speed up processing.
|
| 80 |
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
|
| 81 |
|
| 82 |
Make sure the `DEFAULT_LANG` setting is set correctly for this.
|
|
|
|
| 89 |
python convert.py /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --metadata_file /path/to/metadata.json
|
| 90 |
```
|
| 91 |
|
| 92 |
+
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK` if you're using GPU.
|
| 93 |
- `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
|
| 94 |
- `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, `DEFAULT_LANG` will be used. The format is:
|
| 95 |
|
| 96 |
```
|
| 97 |
{
|
| 98 |
"pdf1.pdf": {"language": "English"},
|
| 99 |
+
"pdf2.pdf": {"language": "Spanish"},
|
| 100 |
...
|
| 101 |
}
|
| 102 |
```
|
|
|
|
| 110 |
```
|
| 111 |
|
| 112 |
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
|
| 113 |
+
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
|
| 114 |
+
- `NUM_WORKERS` is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyond `INFERENCE_RAM / VRAM_PER_TASK`.
|
| 115 |
|
| 116 |
## Benchmark
|
| 117 |
|
| 118 |
+
You can benchmark the performance of marker on your machine. Run `benchmark.py`, like this:
|
| 119 |
|
| 120 |
```
|
| 121 |
python benchmark.py benchmark_data/pdfs benchmark_data/references report.json --nougat
|
| 122 |
```
|
| 123 |
|
| 124 |
+
This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each (4GB).
|
| 125 |
+
|
| 126 |
+
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
|
| 127 |
+
|
| 128 |
+
# Commercial usage
|
| 129 |
+
|
| 130 |
+
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage. I'm building a version that can be used commercially. If you would like to get early access, email me at marker@vikas.sh.
|
benchmark.py
CHANGED
|
@@ -23,11 +23,11 @@ from tabulate import tabulate
|
|
| 23 |
configure_logging()
|
| 24 |
|
| 25 |
|
| 26 |
-
def nougat_prediction(pdf_filename, batch_size=
|
| 27 |
out_dir = tempfile.mkdtemp()
|
| 28 |
# No skipping avoids failure detection, so we attempt to convert the full doc
|
| 29 |
-
# Batch size
|
| 30 |
-
subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--batchsize", str(batch_size)], check=True)
|
| 31 |
md_file = os.listdir(out_dir)[0]
|
| 32 |
with open(os.path.join(out_dir, md_file), "r") as f:
|
| 33 |
data = f.read()
|
|
@@ -41,8 +41,8 @@ if __name__ == "__main__":
|
|
| 41 |
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
|
| 42 |
parser.add_argument("out_file", help="Output filename")
|
| 43 |
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
|
| 44 |
-
parser.add_argument("--nougat_batch_size", type=int, default=
|
| 45 |
-
parser.add_argument("--marker_parallel", type=int, default=4, help="Number of marker processes to run in parallel")
|
| 46 |
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
|
| 47 |
args = parser.parse_args()
|
| 48 |
|
|
|
|
| 23 |
configure_logging()
|
| 24 |
|
| 25 |
|
| 26 |
+
def nougat_prediction(pdf_filename, batch_size=2):
|
| 27 |
out_dir = tempfile.mkdtemp()
|
| 28 |
# No skipping avoids failure detection, so we attempt to convert the full doc
|
| 29 |
+
# Batch size 2 is to match VRAM usage of marker
|
| 30 |
+
subprocess.run(["nougat", pdf_filename, "-o", out_dir, "--no-skipping", "--recompute", "--batchsize", str(batch_size)], check=True)
|
| 31 |
md_file = os.listdir(out_dir)[0]
|
| 32 |
with open(os.path.join(out_dir, md_file), "r") as f:
|
| 33 |
data = f.read()
|
|
|
|
| 41 |
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
|
| 42 |
parser.add_argument("out_file", help="Output filename")
|
| 43 |
parser.add_argument("--nougat", action="store_true", help="Run nougat and compare", default=False)
|
| 44 |
+
parser.add_argument("--nougat_batch_size", type=int, default=settings.NOUGAT_BATCH_SIZE, help="Batch size to use for nougat when making predictions.")
|
| 45 |
+
parser.add_argument("--marker_parallel", type=int, default=4, help="Number of marker CPU processes to run in parallel")
|
| 46 |
parser.add_argument("--md_out_path", type=str, default=None, help="Output path for generated markdown files")
|
| 47 |
args = parser.parse_args()
|
| 48 |
|
marker/benchmark/scoring.py
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
| 2 |
import re
|
| 3 |
|
| 4 |
-
|
| 5 |
-
CHUNK_OVERLAP = 96
|
| 6 |
|
| 7 |
|
| 8 |
def tokenize(text):
|
|
@@ -14,14 +15,9 @@ def tokenize(text):
|
|
| 14 |
return flattened_result
|
| 15 |
|
| 16 |
|
| 17 |
-
def chunk_text(text
|
| 18 |
-
|
| 19 |
-
chunks = []
|
| 20 |
-
step = CHUNK_SIZE
|
| 21 |
-
if overlap:
|
| 22 |
-
step -= CHUNK_OVERLAP
|
| 23 |
-
for i in range(0, len(tokens), step):
|
| 24 |
-
chunks.append(''.join(tokens[i:i+CHUNK_SIZE]))
|
| 25 |
return chunks
|
| 26 |
|
| 27 |
|
|
@@ -29,22 +25,27 @@ def overlap_score(hypothesis_chunks, reference_chunks):
|
|
| 29 |
length_modifier = len(hypothesis_chunks) / len(reference_chunks)
|
| 30 |
search_distance = max(len(reference_chunks) // 5, 10)
|
| 31 |
chunk_scores = []
|
|
|
|
| 32 |
for i, hyp_chunk in enumerate(hypothesis_chunks):
|
| 33 |
max_score = 0
|
|
|
|
| 34 |
i_offset = int(i * length_modifier)
|
| 35 |
chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
|
| 36 |
for j in chunk_range:
|
| 37 |
ref_chunk = reference_chunks[j]
|
| 38 |
-
score = fuzz.ratio(hyp_chunk, ref_chunk)
|
| 39 |
if score > max_score:
|
| 40 |
max_score = score
|
|
|
|
| 41 |
chunk_scores.append(max_score)
|
| 42 |
-
|
|
|
|
|
|
|
| 43 |
|
| 44 |
|
| 45 |
def score_text(hypothesis, reference):
|
| 46 |
# Returns a 0-1 alignment score
|
| 47 |
hypothesis_chunks = chunk_text(hypothesis)
|
| 48 |
reference_chunks = chunk_text(reference)
|
| 49 |
-
chunk_scores = overlap_score(hypothesis_chunks, reference_chunks)
|
| 50 |
-
return sum(chunk_scores) /
|
|
|
|
| 1 |
+
import math
|
| 2 |
+
|
| 3 |
+
from rapidfuzz import fuzz, distance
|
| 4 |
import re
|
| 5 |
|
| 6 |
+
CHUNK_MIN_CHARS = 25
|
|
|
|
| 7 |
|
| 8 |
|
| 9 |
def tokenize(text):
|
|
|
|
| 15 |
return flattened_result
|
| 16 |
|
| 17 |
|
| 18 |
+
def chunk_text(text):
|
| 19 |
+
chunks = text.split("\n")
|
| 20 |
+
chunks = [c for c in chunks if c.strip() and len(c) > CHUNK_MIN_CHARS]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
return chunks
|
| 22 |
|
| 23 |
|
|
|
|
| 25 |
length_modifier = len(hypothesis_chunks) / len(reference_chunks)
|
| 26 |
search_distance = max(len(reference_chunks) // 5, 10)
|
| 27 |
chunk_scores = []
|
| 28 |
+
chunk_weights = []
|
| 29 |
for i, hyp_chunk in enumerate(hypothesis_chunks):
|
| 30 |
max_score = 0
|
| 31 |
+
chunk_weight = 1
|
| 32 |
i_offset = int(i * length_modifier)
|
| 33 |
chunk_range = range(max(0, i_offset-search_distance), min(len(reference_chunks), i_offset+search_distance))
|
| 34 |
for j in chunk_range:
|
| 35 |
ref_chunk = reference_chunks[j]
|
| 36 |
+
score = fuzz.ratio(hyp_chunk, ref_chunk) / 100
|
| 37 |
if score > max_score:
|
| 38 |
max_score = score
|
| 39 |
+
chunk_weight = math.sqrt(len(ref_chunk))
|
| 40 |
chunk_scores.append(max_score)
|
| 41 |
+
chunk_weights.append(chunk_weight)
|
| 42 |
+
chunk_scores = [chunk_scores[i] * chunk_weights[i] for i in range(len(chunk_scores))]
|
| 43 |
+
return chunk_scores, chunk_weights
|
| 44 |
|
| 45 |
|
| 46 |
def score_text(hypothesis, reference):
|
| 47 |
# Returns a 0-1 alignment score
|
| 48 |
hypothesis_chunks = chunk_text(hypothesis)
|
| 49 |
reference_chunks = chunk_text(reference)
|
| 50 |
+
chunk_scores, chunk_weights = overlap_score(hypothesis_chunks, reference_chunks)
|
| 51 |
+
return sum(chunk_scores) / sum(chunk_weights)
|
marker/cleaners/equations.py
CHANGED
|
@@ -23,7 +23,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
|
| 23 |
|
| 24 |
|
| 25 |
def load_nougat_model():
|
| 26 |
-
ckpt = get_checkpoint(None, model_tag=
|
| 27 |
nougat_model = NougatModel.from_pretrained(ckpt)
|
| 28 |
if settings.TORCH_DEVICE != "cpu":
|
| 29 |
move_to_device(nougat_model, bf16=settings.CUDA, cuda=settings.CUDA)
|
|
@@ -94,7 +94,7 @@ def get_nougat_text_batched(images, reformat_region_lens, nougat_model):
|
|
| 94 |
max_length += settings.NOUGAT_TOKEN_BUFFER
|
| 95 |
|
| 96 |
nougat_model.config.max_length = max_length
|
| 97 |
-
model_output = nougat_model.inference(image_tensors=sample)
|
| 98 |
for j, output in enumerate(model_output["predictions"]):
|
| 99 |
disclaimer = ""
|
| 100 |
token_count = get_total_nougat_tokens(output, nougat_model)
|
|
|
|
| 23 |
|
| 24 |
|
| 25 |
def load_nougat_model():
|
| 26 |
+
ckpt = get_checkpoint(None, model_tag=settings.NOUGAT_MODEL_NAME)
|
| 27 |
nougat_model = NougatModel.from_pretrained(ckpt)
|
| 28 |
if settings.TORCH_DEVICE != "cpu":
|
| 29 |
move_to_device(nougat_model, bf16=settings.CUDA, cuda=settings.CUDA)
|
|
|
|
| 94 |
max_length += settings.NOUGAT_TOKEN_BUFFER
|
| 95 |
|
| 96 |
nougat_model.config.max_length = max_length
|
| 97 |
+
model_output = nougat_model.inference(image_tensors=sample, early_stopping=False)
|
| 98 |
for j, output in enumerate(model_output["predictions"]):
|
| 99 |
disclaimer = ""
|
| 100 |
token_count = get_total_nougat_tokens(output, nougat_model)
|
marker/ordering.py
CHANGED
|
@@ -12,11 +12,11 @@ import io
|
|
| 12 |
from marker.schema import Page
|
| 13 |
from marker.settings import settings
|
| 14 |
|
| 15 |
-
processor = LayoutLMv3Processor.from_pretrained(
|
| 16 |
|
| 17 |
|
| 18 |
def load_ordering_model():
|
| 19 |
-
model = LayoutLMv3ForSequenceClassification.from_pretrained(
|
| 20 |
if settings.CUDA:
|
| 21 |
model = model.to(torch.bfloat16)
|
| 22 |
return model
|
|
|
|
| 12 |
from marker.schema import Page
|
| 13 |
from marker.settings import settings
|
| 14 |
|
| 15 |
+
processor = LayoutLMv3Processor.from_pretrained(settings.ORDERER_MODEL_NAME)
|
| 16 |
|
| 17 |
|
| 18 |
def load_ordering_model():
|
| 19 |
+
model = LayoutLMv3ForSequenceClassification.from_pretrained(settings.ORDERER_MODEL_NAME).to(settings.TORCH_DEVICE)
|
| 20 |
if settings.CUDA:
|
| 21 |
model = model.to(torch.bfloat16)
|
| 22 |
return model
|
marker/segmentation.py
CHANGED
|
@@ -17,14 +17,14 @@ from math import isclose
|
|
| 17 |
# Otherwise some images can be truncated
|
| 18 |
Image.MAX_IMAGE_PIXELS = None
|
| 19 |
|
| 20 |
-
processor = LayoutLMv3Processor.from_pretrained(
|
| 21 |
|
| 22 |
CHUNK_KEYS = ["input_ids", "attention_mask", "bbox", "offset_mapping"]
|
| 23 |
NO_CHUNK_KEYS = ["pixel_values"]
|
| 24 |
|
| 25 |
|
| 26 |
def load_layout_model():
|
| 27 |
-
model = LayoutLMv3ForTokenClassification.from_pretrained(
|
| 28 |
if settings.CUDA:
|
| 29 |
model = model.to(torch.bfloat16)
|
| 30 |
|
|
|
|
| 17 |
# Otherwise some images can be truncated
|
| 18 |
Image.MAX_IMAGE_PIXELS = None
|
| 19 |
|
| 20 |
+
processor = LayoutLMv3Processor.from_pretrained(settings.LAYOUT_MODEL_NAME, apply_ocr=False)
|
| 21 |
|
| 22 |
CHUNK_KEYS = ["input_ids", "attention_mask", "bbox", "offset_mapping"]
|
| 23 |
NO_CHUNK_KEYS = ["pixel_values"]
|
| 24 |
|
| 25 |
|
| 26 |
def load_layout_model():
|
| 27 |
+
model = LayoutLMv3ForTokenClassification.from_pretrained(settings.LAYOUT_MODEL_NAME).to(settings.TORCH_DEVICE)
|
| 28 |
if settings.CUDA:
|
| 29 |
model = model.to(torch.bfloat16)
|
| 30 |
|
marker/settings.py
CHANGED
|
@@ -53,17 +53,19 @@ class Settings(BaseSettings):
|
|
| 53 |
NOUGAT_HALLUCINATION_WORDS: List[str] = ["[MISSING_PAGE_POST]", "## References\n", "**Figure Captions**\n", "Footnote",
|
| 54 |
"\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]"]
|
| 55 |
NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
|
| 56 |
-
NOUGAT_MODEL_NAME: str = "
|
| 57 |
-
NOUGAT_BATCH_SIZE: int = 4
|
| 58 |
|
| 59 |
# Layout Model
|
| 60 |
BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
|
| 61 |
LAYOUT_MODEL_MAX: int = 512
|
| 62 |
LAYOUT_CHUNK_OVERLAP: int = 64
|
| 63 |
LAYOUT_DPI: int = 96
|
|
|
|
| 64 |
|
| 65 |
# Ordering model
|
| 66 |
ORDERER_BATCH_SIZE: int = 16 # This can be high, because max token count is 128
|
|
|
|
| 67 |
|
| 68 |
# Ray
|
| 69 |
RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache
|
|
|
|
| 53 |
NOUGAT_HALLUCINATION_WORDS: List[str] = ["[MISSING_PAGE_POST]", "## References\n", "**Figure Captions**\n", "Footnote",
|
| 54 |
"\par\par\par", "## Chapter", "Fig.", "particle", "[REPEATS]", "[TRUNCATED]"]
|
| 55 |
NOUGAT_DPI: int = 96 # DPI to render images at, matches default settings for nougat
|
| 56 |
+
NOUGAT_MODEL_NAME: str = "0.1.0-small" # Name of the model to use
|
| 57 |
+
NOUGAT_BATCH_SIZE: int = 4 if TORCH_DEVICE == "cuda" else 1 # Batch size for nougat, don't batch on cpu
|
| 58 |
|
| 59 |
# Layout Model
|
| 60 |
BAD_SPAN_TYPES: List[str] = ["Caption", "Footnote", "Page-footer", "Page-header", "Picture"]
|
| 61 |
LAYOUT_MODEL_MAX: int = 512
|
| 62 |
LAYOUT_CHUNK_OVERLAP: int = 64
|
| 63 |
LAYOUT_DPI: int = 96
|
| 64 |
+
LAYOUT_MODEL_NAME: str = "vikp/layout_segmenter"
|
| 65 |
|
| 66 |
# Ordering model
|
| 67 |
ORDERER_BATCH_SIZE: int = 16 # This can be high, because max token count is 128
|
| 68 |
+
ORDERER_MODEL_NAME: str = "vikp/column_detector"
|
| 69 |
|
| 70 |
# Ray
|
| 71 |
RAY_CACHE_PATH: Optional[str] = None # Where to save ray cache
|