Vik Paruchuri commited on
Commit
4009b03
·
1 Parent(s): 60e9de5

Bump package version

Browse files
Files changed (3) hide show
  1. README.md +18 -35
  2. docs/install_ocrmypdf.md +29 -0
  3. pyproject.toml +1 -1
README.md CHANGED
@@ -30,7 +30,6 @@ It only uses models where necessary, which improves speed and accuracy.
30
  | [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
31
  | [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md) |
32
 
33
-
34
  ## Performance
35
 
36
  ![Benchmark overall](data/images/overall.png)
@@ -39,6 +38,12 @@ The above results are with marker and nougat setup so they each take ~4GB of VRA
39
 
40
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
41
 
 
 
 
 
 
 
42
  # Community
43
 
44
  [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
@@ -48,13 +53,14 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc
48
  PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
49
 
50
  - Marker will not convert 100% of equations to LaTeX. This is because it has to detect then convert.
 
51
  - Whitespace and indentations are not always respected.
52
  - Not all lines/spans will be joined properly.
53
  - This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
54
 
55
  # Installation
56
 
57
- This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
58
 
59
  Install with:
60
 
@@ -62,32 +68,15 @@ Install with:
62
  pip install marker-pdf
63
  ```
64
 
65
- ## Optional
66
-
67
- Only needed if using `ocrmypdf` as the ocr backend.
68
-
69
- **Linux**
70
 
71
- - Run `pip install ocrmypdf`
72
- - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
73
- - Install other requirements with `cat scripts/install/tess-apt-requirements.txt | xargs sudo apt-get install -y`
74
- - Set the tesseract data folder path
75
- - Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
76
- - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
77
 
78
- **Mac**
79
-
80
- Only needed if using `ocrmypdf` as the ocr backend.
81
-
82
- - Run `pip install ocrmypdf`
83
- - Install system requirements from `scripts/install/tess-brew-requirements.txt`
84
- - Set the tesseract data folder path
85
- - Find the tesseract data folder `tessdata` with `brew list tesseract`
86
- - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
87
 
88
  # Usage
89
 
90
- First, some configuration. Note that settings can be overridden with env vars.
91
 
92
  - Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
93
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
@@ -98,7 +87,7 @@ First, some configuration. Note that settings can be overridden with env vars.
98
  ## Convert a single file
99
 
100
  ```shell
101
- marker_single /path/to/file.pdf /path/to/output/folder --parallel_factor 2 --max_pages 10 --langs English
102
  ```
103
 
104
  - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
@@ -141,16 +130,18 @@ MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 mar
141
 
142
  Note that the env variables above are specific to this script, and cannot be set in `local.env`.
143
 
144
- # Important settings/Troubleshooting
145
 
146
- There are some settings that you may find especially useful if things aren't working the way you expect:
147
 
148
  - `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
149
  - `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
150
  - `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
151
  - `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
 
 
152
 
153
- In general, if output is not what you expect, trying to OCR the PDF is a good first step.
154
 
155
  # Benchmarks
156
 
@@ -201,14 +192,6 @@ This will benchmark marker against other text extraction methods. It sets up ba
201
 
202
  Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
203
 
204
- # Commercial usage
205
-
206
- All models were trained from scratch, so they're okay for commercial usage. The weights for the models are licensed cc-by-nc-sa-4.0, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised.
207
-
208
- If you want to remove the GPL license requirements for inference or use the weights commercially over the revenue limit, please contact me at marker@vikas.sh for dual licensing.
209
-
210
- Note that the `ocrmypdf` OCR option will use ocrmypdf, which includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions. Ocrmypdf is disabled by default, and will not be installed automatically.
211
-
212
  # Thanks
213
 
214
  This work would not have been possible without amazing open source models and datasets, including (but not limited to):
 
30
  | [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
31
  | [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md) |
32
 
 
33
  ## Performance
34
 
35
  ![Benchmark overall](data/images/overall.png)
 
38
 
39
  See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
40
 
41
+ # Commercial usage
42
+
43
+ I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
44
+
45
+ The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
46
+
47
  # Community
48
 
49
  [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
 
53
  PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
54
 
55
  - Marker will not convert 100% of equations to LaTeX. This is because it has to detect then convert.
56
+ - Tables are not always formatted 100% correctly - text can be in the wrong column.
57
  - Whitespace and indentations are not always respected.
58
  - Not all lines/spans will be joined properly.
59
  - This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
60
 
61
  # Installation
62
 
63
+ You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
64
 
65
  Install with:
66
 
 
68
  pip install marker-pdf
69
  ```
70
 
71
+ ## Optional: OCRMyPDF
 
 
 
 
72
 
73
+ Only needed if you want to use the optional `ocrmypdf` as the ocr backend. Note that `ocrmypdf` includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions.
 
 
 
 
 
74
 
75
+ See the instructions [here](docs/install_ocrmypdf.md)
 
 
 
 
 
 
 
 
76
 
77
  # Usage
78
 
79
+ First, some configuration:
80
 
81
  - Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
82
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
 
87
  ## Convert a single file
88
 
89
  ```shell
90
+ marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 --langs English
91
  ```
92
 
93
  - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
 
130
 
131
  Note that the env variables above are specific to this script, and cannot be set in `local.env`.
132
 
133
+ # Troubleshooting
134
 
135
+ There are some settings that you may find useful if things aren't working the way you expect:
136
 
137
  - `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
138
  - `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
139
  - `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
140
  - `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
141
+ - Verify that you set the languages correctly, or passed in a metadata file.
142
+ - If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.
143
 
144
+ In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
145
 
146
  # Benchmarks
147
 
 
192
 
193
  Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
194
 
 
 
 
 
 
 
 
 
195
  # Thanks
196
 
197
  This work would not have been possible without amazing open source models and datasets, including (but not limited to):
docs/install_ocrmypdf.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Linux
2
+
3
+ - Run `apt-get install ocrmypdf`
4
+ - Install ghostscript > 9.55 by following [these instructions](https://ghostscript.readthedocs.io/en/latest/Install.html) or running `scripts/install/ghostscript_install.sh`.
5
+ - Run `pip install ocrmypdf`
6
+ - Install any tesseract language packages that you want (example `apt-get install tesseract-ocr-eng`)
7
+ - Set the tesseract data folder path
8
+ - Find the tesseract data folder `tessdata` with `find / -name tessdata`. Make sure to use the one corresponding to the latest tesseract version if you have multiple.
9
+ - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
10
+
11
+ ## Mac
12
+
13
+ Only needed if using `ocrmypdf` as the ocr backend.
14
+
15
+ - Run `brew install ocrmypdf`
16
+ - Run `brew install tesseract-lang` to add language support
17
+ - Run `pip install ocrmypdf`
18
+ - Set the tesseract data folder path
19
+ - Find the tesseract data folder `tessdata` with `brew list tesseract`
20
+ - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
21
+
22
+ ## Windows
23
+
24
+ - Install `ocrmypdf` and ghostscript by following [these instructions](https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-on-windows)
25
+ - Run `pip install ocrmypdf`
26
+ - Install any tesseract language packages you want
27
+ - Set the tesseract data folder path
28
+ - Find the tesseract data folder `tessdata` with `brew list tesseract`
29
+ - Create a `local.env` file in the root `marker` folder with `TESSDATA_PREFIX=/path/to/tessdata` inside it
pyproject.toml CHANGED
@@ -1,6 +1,6 @@
1
  [tool.poetry]
2
  name = "marker-pdf"
3
- version = "0.2.4"
4
  description = "Convert PDF to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"
 
1
  [tool.poetry]
2
  name = "marker-pdf"
3
+ version = "0.2.5"
4
  description = "Convert PDF to markdown with high speed and accuracy."
5
  authors = ["Vik Paruchuri <github@vikas.sh>"]
6
  readme = "README.md"