u-ashish commited on
Commit
cf5101a
·
1 Parent(s): 0437a20

Update README

Browse files
Files changed (1) hide show
  1. README.md +11 -9
README.md CHANGED
@@ -11,6 +11,8 @@ Marker converts documents to markdown, JSON, chunks, and HTML quickly and accura
11
  - Optionally boost accuracy with LLMs (and your own prompt)
12
  - Works on GPU, CPU, or MPS
13
 
 
 
14
  ## Performance
15
 
16
  <img src="data/images/overall.png" width="800px"/>
@@ -41,11 +43,11 @@ As you can see, the use_llm mode offers higher accuracy than marker or gemini al
41
 
42
  # Commercial usage
43
 
44
- Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to).
45
 
46
  # Hosted API
47
 
48
- There's a hosted API for marker available [here](https://www.datalab.to/):
49
 
50
  - Supports PDF, image, PPT, PPTX, DOC, DOCX, XLS, XLSX, HTML, EPUB files
51
  - 1/4th the price of leading cloud-based competitors
@@ -102,7 +104,7 @@ Options:
102
  - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
103
  - `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
104
  - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
105
- - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
106
  - `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
107
  - `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
108
  - `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
@@ -182,7 +184,7 @@ rendered = converter("FILEPATH")
182
 
183
  ### Extract blocks
184
 
185
- Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
186
 
187
  Here's an example of extracting all forms from a document:
188
 
@@ -222,7 +224,7 @@ text, _, images = text_from_rendered(rendered)
222
 
223
  This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
224
 
225
- You can also run this via the CLI with
226
  ```shell
227
  marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
228
  ```
@@ -243,7 +245,7 @@ rendered = converter("FILEPATH")
243
 
244
  This takes all the same configuration as the PdfConverter.
245
 
246
- You can also run this via the CLI with
247
  ```shell
248
  marker_single FILENAME --converter_cls marker.converters.ocr.OCRConverter
249
  ```
@@ -260,7 +262,7 @@ from pydantic import BaseModel
260
 
261
  class Links(BaseModel):
262
  links: list[str]
263
-
264
  schema = Links.model_json_schema()
265
  config_parser = ConfigParser({
266
  "page_schema": schema
@@ -300,7 +302,7 @@ HTML output is similar to markdown output:
300
 
301
  JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
302
 
303
- The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
304
 
305
  Pages have the keys:
306
 
@@ -366,7 +368,7 @@ All output formats will return a metadata dictionary, with the following fields:
366
  ], // computed PDF table of contents
367
  "page_stats": [
368
  {
369
- "page_id": 0,
370
  "text_extraction_method": "pdftext",
371
  "block_counts": [("Span", 200), ...]
372
  },
 
11
  - Optionally boost accuracy with LLMs (and your own prompt)
12
  - Works on GPU, CPU, or MPS
13
 
14
+ For our managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to?utm_source=gh-marker).
15
+
16
  ## Performance
17
 
18
  <img src="data/images/overall.png" width="800px"/>
 
43
 
44
  # Commercial usage
45
 
46
+ Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to?utm_source=gh-marker).
47
 
48
  # Hosted API
49
 
50
+ There's a hosted API for marker available [here](https://www.datalab.to?utm_source=gh-marker):
51
 
52
  - Supports PDF, image, PPT, PPTX, DOC, DOCX, XLS, XLSX, HTML, EPUB files
53
  - 1/4th the price of leading cloud-based competitors
 
104
  - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
105
  - `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
106
  - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
107
+ - `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
108
  - `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
109
  - `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
110
  - `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
 
184
 
185
  ### Extract blocks
186
 
187
+ Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
188
 
189
  Here's an example of extracting all forms from a document:
190
 
 
224
 
225
  This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
226
 
227
+ You can also run this via the CLI with
228
  ```shell
229
  marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
230
  ```
 
245
 
246
  This takes all the same configuration as the PdfConverter.
247
 
248
+ You can also run this via the CLI with
249
  ```shell
250
  marker_single FILENAME --converter_cls marker.converters.ocr.OCRConverter
251
  ```
 
262
 
263
  class Links(BaseModel):
264
  links: list[str]
265
+
266
  schema = Links.model_json_schema()
267
  config_parser = ConfigParser({
268
  "page_schema": schema
 
302
 
303
  JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
304
 
305
+ The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
306
 
307
  Pages have the keys:
308
 
 
368
  ], // computed PDF table of contents
369
  "page_stats": [
370
  {
371
+ "page_id": 0,
372
  "text_extraction_method": "pdftext",
373
  "block_counts": [("Span", 200), ...]
374
  },