Vik Paruchuri
commited on
Commit
·
6fb08e1
1
Parent(s):
ed65502
Fix README
Browse files- README.md +9 -19
- marker/converters/ocr.py +0 -1
README.md
CHANGED
|
@@ -79,7 +79,7 @@ pip install marker-pdf[full]
|
|
| 79 |
First, some configuration:
|
| 80 |
|
| 81 |
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 82 |
-
- Some PDFs, even digital ones, have bad text in them. Set the `
|
| 83 |
|
| 84 |
## Interactive App
|
| 85 |
|
|
@@ -99,15 +99,16 @@ marker_single /path/to/file.pdf
|
|
| 99 |
You can pass in PDFs or images.
|
| 100 |
|
| 101 |
Options:
|
| 102 |
-
- `--
|
| 103 |
- `--output_format [markdown|json|html]`: Specify the format for the output results.
|
|
|
|
| 104 |
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
|
| 105 |
-
- `--use_llm`: Uses an LLM to improve accuracy. You
|
| 106 |
-
- `--
|
| 107 |
-
- `--disable_image_extraction`: Don't extract images from the PDF. If you also specify `--use_llm`, then images will be replaced with a description.
|
| 108 |
-
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 109 |
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
|
| 110 |
- `--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
|
|
|
|
|
|
|
| 111 |
- `--debug`: Enable debug mode for additional logging and diagnostic information.
|
| 112 |
- `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
|
| 113 |
- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
|
|
@@ -229,7 +230,7 @@ marker_single FILENAME --use_llm --force_layout_block Table --converter_cls mark
|
|
| 229 |
|
| 230 |
### OCR Only
|
| 231 |
|
| 232 |
-
If you only want to run OCR, you can also do that through the `OCRConverter`.
|
| 233 |
|
| 234 |
```python
|
| 235 |
from marker.converters.ocr import OCRConverter
|
|
@@ -520,15 +521,4 @@ PDF is a tricky format, so marker will not always work perfectly. Here are some
|
|
| 520 |
- Very complex layouts, with nested tables and forms, may not work
|
| 521 |
- Forms may not be rendered well
|
| 522 |
|
| 523 |
-
Note: Passing the `--use_llm`
|
| 524 |
-
|
| 525 |
-
# Thanks
|
| 526 |
-
|
| 527 |
-
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
|
| 528 |
-
|
| 529 |
-
- Surya
|
| 530 |
-
- Texify
|
| 531 |
-
- Pypdfium2/pdfium
|
| 532 |
-
- DocLayNet from IBM
|
| 533 |
-
|
| 534 |
-
Thank you to the authors of these models and datasets for making them available to the community!
|
|
|
|
| 79 |
First, some configuration:
|
| 80 |
|
| 81 |
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 82 |
+
- Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
|
| 83 |
|
| 84 |
## Interactive App
|
| 85 |
|
|
|
|
| 99 |
You can pass in PDFs or images.
|
| 100 |
|
| 101 |
Options:
|
| 102 |
+
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 103 |
- `--output_format [markdown|json|html]`: Specify the format for the output results.
|
| 104 |
+
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
| 105 |
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
|
| 106 |
+
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
|
| 107 |
+
- `--format_lines`: Reformat all lines using a local OCR model (inline math, underlines, bold, etc.). This will give very good quality math output.
|
|
|
|
|
|
|
| 108 |
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
|
| 109 |
- `--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
|
| 110 |
+
- `--redo_inline_math`: If you want the absolute highest quality inline math conversion, use this along with `--use_llm`.
|
| 111 |
+
- `--disable_image_extraction`: Don't extract images from the PDF. If you also specify `--use_llm`, then images will be replaced with a description.
|
| 112 |
- `--debug`: Enable debug mode for additional logging and diagnostic information.
|
| 113 |
- `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
|
| 114 |
- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
|
|
|
|
| 230 |
|
| 231 |
### OCR Only
|
| 232 |
|
| 233 |
+
If you only want to run OCR, you can also do that through the `OCRConverter`. Set `--keep_chars` to keep individual characters and bounding boxes. You can also set `--force_ocr` and `--format_lines` with this converter.
|
| 234 |
|
| 235 |
```python
|
| 236 |
from marker.converters.ocr import OCRConverter
|
|
|
|
| 521 |
- Very complex layouts, with nested tables and forms, may not work
|
| 522 |
- Forms may not be rendered well
|
| 523 |
|
| 524 |
+
Note: Passing the `--use_llm` and `--format_lines` flags will mostly solve these issues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
marker/converters/ocr.py
CHANGED
|
@@ -20,7 +20,6 @@ class OCRConverter(PdfConverter):
|
|
| 20 |
self.config = {}
|
| 21 |
|
| 22 |
self.config["format_lines"] = True
|
| 23 |
-
self.config["keep_chars"] = True
|
| 24 |
self.renderer = OCRJSONRenderer
|
| 25 |
|
| 26 |
def build_document(self, filepath: str):
|
|
|
|
| 20 |
self.config = {}
|
| 21 |
|
| 22 |
self.config["format_lines"] = True
|
|
|
|
| 23 |
self.renderer = OCRJSONRenderer
|
| 24 |
|
| 25 |
def build_document(self, filepath: str):
|