Vik Paruchuri
commited on
Commit
·
e006e5a
1
Parent(s):
0e97894
Add documentation for LLM mode
Browse files
README.md
CHANGED
|
@@ -9,6 +9,7 @@ Marker converts PDFs to markdown, JSON, and HTML quickly and accurately.
|
|
| 9 |
- Extracts and saves images along with the markdown
|
| 10 |
- Converts equations to latex
|
| 11 |
- Easily extensible with your own formatting and logic
|
|
|
|
| 12 |
- Works on GPU, CPU, or MPS
|
| 13 |
|
| 14 |
## How it works
|
|
@@ -99,10 +100,11 @@ marker_single /path/to/file.pdf
|
|
| 99 |
|
| 100 |
Options:
|
| 101 |
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
| 102 |
-
- `--debug`: Enable debug mode for additional logging and diagnostic information.
|
| 103 |
- `--output_format [markdown|json|html]`: Specify the format for the output results.
|
|
|
|
| 104 |
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 105 |
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
|
|
|
|
| 106 |
- `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
|
| 107 |
- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
|
| 108 |
- `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "eng,fra,deu"` for English, French, and German.
|
|
@@ -127,7 +129,6 @@ NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
|
|
| 127 |
|
| 128 |
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
|
| 129 |
- `NUM_WORKERS` is the number of parallel processes to run on each GPU.
|
| 130 |
-
-
|
| 131 |
|
| 132 |
## Use from python
|
| 133 |
|
|
@@ -332,6 +333,7 @@ Note that this is not a very robust API, and is only intended for small-scale us
|
|
| 332 |
|
| 333 |
There are some settings that you may find useful if things aren't working the way you expect:
|
| 334 |
|
|
|
|
| 335 |
- Make sure to set `force_ocr` if you see garbled text - this will re-OCR the document.
|
| 336 |
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
|
| 337 |
- If you're getting out of memory errors, decrease worker count. You can also try splitting up long PDFs into multiple files.
|
|
|
|
| 9 |
- Extracts and saves images along with the markdown
|
| 10 |
- Converts equations to latex
|
| 11 |
- Easily extensible with your own formatting and logic
|
| 12 |
+
- Optionally boost accuracy with an LLM
|
| 13 |
- Works on GPU, CPU, or MPS
|
| 14 |
|
| 15 |
## How it works
|
|
|
|
| 100 |
|
| 101 |
Options:
|
| 102 |
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
|
|
|
| 103 |
- `--output_format [markdown|json|html]`: Specify the format for the output results.
|
| 104 |
+
- `--use_llm`: Uses an LLM to improve accuracy. You must set your Gemini API key using the `GOOGLE_API_KEY` env var.
|
| 105 |
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 106 |
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
|
| 107 |
+
- `--debug`: Enable debug mode for additional logging and diagnostic information.
|
| 108 |
- `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
|
| 109 |
- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
|
| 110 |
- `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "eng,fra,deu"` for English, French, and German.
|
|
|
|
| 129 |
|
| 130 |
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
|
| 131 |
- `NUM_WORKERS` is the number of parallel processes to run on each GPU.
|
|
|
|
| 132 |
|
| 133 |
## Use from python
|
| 134 |
|
|
|
|
| 333 |
|
| 334 |
There are some settings that you may find useful if things aren't working the way you expect:
|
| 335 |
|
| 336 |
+
- If you have issues with accuracy, try setting `--use_llm` to use an LLM to improve quality. You must set `GOOGLE_API_KEY` to a Gemini API key for this to work.
|
| 337 |
- Make sure to set `force_ocr` if you see garbled text - this will re-OCR the document.
|
| 338 |
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
|
| 339 |
- If you're getting out of memory errors, decrease worker count. You can also try splitting up long PDFs into multiple files.
|