Vik Paruchuri commited on
Commit
6fb08e1
·
1 Parent(s): ed65502

Fix README

Browse files
Files changed (2) hide show
  1. README.md +9 -19
  2. marker/converters/ocr.py +0 -1
README.md CHANGED
@@ -79,7 +79,7 @@ pip install marker-pdf[full]
79
  First, some configuration:
80
 
81
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
82
- - Some PDFs, even digital ones, have bad text in them. Set the `force_ocr` flag to ensure your PDF runs through OCR, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
83
 
84
  ## Interactive App
85
 
@@ -99,15 +99,16 @@ marker_single /path/to/file.pdf
99
  You can pass in PDFs or images.
100
 
101
  Options:
102
- - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
103
  - `--output_format [markdown|json|html]`: Specify the format for the output results.
 
104
  - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
105
- - `--use_llm`: Uses an LLM to improve accuracy. You must set your Gemini API key using the `GOOGLE_API_KEY` env var.
106
- - `--redo_inline_math`: If you want the highest quality inline math conversion, use this along with `--use_llm`.
107
- - `--disable_image_extraction`: Don't extract images from the PDF. If you also specify `--use_llm`, then images will be replaced with a description.
108
- - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
109
  - `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
110
  - `--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
 
 
111
  - `--debug`: Enable debug mode for additional logging and diagnostic information.
112
  - `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
113
  - `--config_json PATH`: Path to a JSON configuration file containing additional settings.
@@ -229,7 +230,7 @@ marker_single FILENAME --use_llm --force_layout_block Table --converter_cls mark
229
 
230
  ### OCR Only
231
 
232
- If you only want to run OCR, you can also do that through the `OCRConverter`.
233
 
234
  ```python
235
  from marker.converters.ocr import OCRConverter
@@ -520,15 +521,4 @@ PDF is a tricky format, so marker will not always work perfectly. Here are some
520
  - Very complex layouts, with nested tables and forms, may not work
521
  - Forms may not be rendered well
522
 
523
- Note: Passing the `--use_llm` flag will mostly solve these issues.
524
-
525
- # Thanks
526
-
527
- This work would not have been possible without amazing open source models and datasets, including (but not limited to):
528
-
529
- - Surya
530
- - Texify
531
- - Pypdfium2/pdfium
532
- - DocLayNet from IBM
533
-
534
- Thank you to the authors of these models and datasets for making them available to the community!
 
79
  First, some configuration:
80
 
81
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
82
+ - Some PDFs, even digital ones, have bad text in them. Set the `format_lines` flag to ensure the bad lines are fixed and formatted. You can also set `--force_ocr` to force OCR on all lines, or the `strip_existing_ocr` to keep all digital text, and strip out any existing OCR text.
83
 
84
  ## Interactive App
85
 
 
99
  You can pass in PDFs or images.
100
 
101
  Options:
102
+ - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
103
  - `--output_format [markdown|json|html]`: Specify the format for the output results.
104
+ - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
105
  - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
106
+ - `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
107
+ - `--format_lines`: Reformat all lines using a local OCR model (inline math, underlines, bold, etc.). This will give very good quality math output.
 
 
108
  - `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
109
  - `--strip_existing_ocr`: Remove all existing OCR text in the document and re-OCR with surya.
110
+ - `--redo_inline_math`: If you want the absolute highest quality inline math conversion, use this along with `--use_llm`.
111
+ - `--disable_image_extraction`: Don't extract images from the PDF. If you also specify `--use_llm`, then images will be replaced with a description.
112
  - `--debug`: Enable debug mode for additional logging and diagnostic information.
113
  - `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
114
  - `--config_json PATH`: Path to a JSON configuration file containing additional settings.
 
230
 
231
  ### OCR Only
232
 
233
+ If you only want to run OCR, you can also do that through the `OCRConverter`. Set `--keep_chars` to keep individual characters and bounding boxes. You can also set `--force_ocr` and `--format_lines` with this converter.
234
 
235
  ```python
236
  from marker.converters.ocr import OCRConverter
 
521
  - Very complex layouts, with nested tables and forms, may not work
522
  - Forms may not be rendered well
523
 
524
+ Note: Passing the `--use_llm` and `--format_lines` flags will mostly solve these issues.
 
 
 
 
 
 
 
 
 
 
 
marker/converters/ocr.py CHANGED
@@ -20,7 +20,6 @@ class OCRConverter(PdfConverter):
20
  self.config = {}
21
 
22
  self.config["format_lines"] = True
23
- self.config["keep_chars"] = True
24
  self.renderer = OCRJSONRenderer
25
 
26
  def build_document(self, filepath: str):
 
20
  self.config = {}
21
 
22
  self.config["format_lines"] = True
 
23
  self.renderer = OCRJSONRenderer
24
 
25
  def build_document(self, filepath: str):