u-ashish
commited on
Commit
·
cf5101a
1
Parent(s):
0437a20
Update README
Browse files
README.md
CHANGED
|
@@ -11,6 +11,8 @@ Marker converts documents to markdown, JSON, chunks, and HTML quickly and accura
|
|
| 11 |
- Optionally boost accuracy with LLMs (and your own prompt)
|
| 12 |
- Works on GPU, CPU, or MPS
|
| 13 |
|
|
|
|
|
|
|
| 14 |
## Performance
|
| 15 |
|
| 16 |
<img src="data/images/overall.png" width="800px"/>
|
|
@@ -41,11 +43,11 @@ As you can see, the use_llm mode offers higher accuracy than marker or gemini al
|
|
| 41 |
|
| 42 |
# Commercial usage
|
| 43 |
|
| 44 |
-
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to).
|
| 45 |
|
| 46 |
# Hosted API
|
| 47 |
|
| 48 |
-
There's a hosted API for marker available [here](https://www.datalab.to
|
| 49 |
|
| 50 |
- Supports PDF, image, PPT, PPTX, DOC, DOCX, XLS, XLSX, HTML, EPUB files
|
| 51 |
- 1/4th the price of leading cloud-based competitors
|
|
@@ -102,7 +104,7 @@ Options:
|
|
| 102 |
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 103 |
- `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
|
| 104 |
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
| 105 |
-
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
|
| 106 |
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
|
| 107 |
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
|
| 108 |
- `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
|
|
@@ -182,7 +184,7 @@ rendered = converter("FILEPATH")
|
|
| 182 |
|
| 183 |
### Extract blocks
|
| 184 |
|
| 185 |
-
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
|
| 186 |
|
| 187 |
Here's an example of extracting all forms from a document:
|
| 188 |
|
|
@@ -222,7 +224,7 @@ text, _, images = text_from_rendered(rendered)
|
|
| 222 |
|
| 223 |
This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
|
| 224 |
|
| 225 |
-
You can also run this via the CLI with
|
| 226 |
```shell
|
| 227 |
marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
|
| 228 |
```
|
|
@@ -243,7 +245,7 @@ rendered = converter("FILEPATH")
|
|
| 243 |
|
| 244 |
This takes all the same configuration as the PdfConverter.
|
| 245 |
|
| 246 |
-
You can also run this via the CLI with
|
| 247 |
```shell
|
| 248 |
marker_single FILENAME --converter_cls marker.converters.ocr.OCRConverter
|
| 249 |
```
|
|
@@ -260,7 +262,7 @@ from pydantic import BaseModel
|
|
| 260 |
|
| 261 |
class Links(BaseModel):
|
| 262 |
links: list[str]
|
| 263 |
-
|
| 264 |
schema = Links.model_json_schema()
|
| 265 |
config_parser = ConfigParser({
|
| 266 |
"page_schema": schema
|
|
@@ -300,7 +302,7 @@ HTML output is similar to markdown output:
|
|
| 300 |
|
| 301 |
JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
|
| 302 |
|
| 303 |
-
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
|
| 304 |
|
| 305 |
Pages have the keys:
|
| 306 |
|
|
@@ -366,7 +368,7 @@ All output formats will return a metadata dictionary, with the following fields:
|
|
| 366 |
], // computed PDF table of contents
|
| 367 |
"page_stats": [
|
| 368 |
{
|
| 369 |
-
"page_id": 0,
|
| 370 |
"text_extraction_method": "pdftext",
|
| 371 |
"block_counts": [("Span", 200), ...]
|
| 372 |
},
|
|
|
|
| 11 |
- Optionally boost accuracy with LLMs (and your own prompt)
|
| 12 |
- Works on GPU, CPU, or MPS
|
| 13 |
|
| 14 |
+
For our managed API or on-prem document intelligence solution, check out [our platform here](https://datalab.to?utm_source=gh-marker).
|
| 15 |
+
|
| 16 |
## Performance
|
| 17 |
|
| 18 |
<img src="data/images/overall.png" width="800px"/>
|
|
|
|
| 43 |
|
| 44 |
# Commercial usage
|
| 45 |
|
| 46 |
+
Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page [here](https://www.datalab.to?utm_source=gh-marker).
|
| 47 |
|
| 48 |
# Hosted API
|
| 49 |
|
| 50 |
+
There's a hosted API for marker available [here](https://www.datalab.to?utm_source=gh-marker):
|
| 51 |
|
| 52 |
- Supports PDF, image, PPT, PPTX, DOC, DOCX, XLS, XLSX, HTML, EPUB files
|
| 53 |
- 1/4th the price of leading cloud-based competitors
|
|
|
|
| 104 |
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
|
| 105 |
- `--output_format [markdown|json|html|chunks]`: Specify the format for the output results.
|
| 106 |
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
|
| 107 |
+
- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
|
| 108 |
- `--use_llm`: Uses an LLM to improve accuracy. You will need to configure the LLM backend - see [below](#llm-services).
|
| 109 |
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text. This will also format inline math properly.
|
| 110 |
- `--block_correction_prompt`: if LLM mode is active, an optional prompt that will be used to correct the output of marker. This is useful for custom formatting or logic that you want to apply to the output.
|
|
|
|
| 184 |
|
| 185 |
### Extract blocks
|
| 186 |
|
| 187 |
+
Each document consists of one or more pages. Pages contain blocks, which can themselves contain other blocks. It's possible to programmatically manipulate these blocks.
|
| 188 |
|
| 189 |
Here's an example of extracting all forms from a document:
|
| 190 |
|
|
|
|
| 224 |
|
| 225 |
This takes all the same configuration as the PdfConverter. You can specify the configuration `force_layout_block=Table` to avoid layout detection and instead assume every page is a table. Set `output_format=json` to also get cell bounding boxes.
|
| 226 |
|
| 227 |
+
You can also run this via the CLI with
|
| 228 |
```shell
|
| 229 |
marker_single FILENAME --use_llm --force_layout_block Table --converter_cls marker.converters.table.TableConverter --output_format json
|
| 230 |
```
|
|
|
|
| 245 |
|
| 246 |
This takes all the same configuration as the PdfConverter.
|
| 247 |
|
| 248 |
+
You can also run this via the CLI with
|
| 249 |
```shell
|
| 250 |
marker_single FILENAME --converter_cls marker.converters.ocr.OCRConverter
|
| 251 |
```
|
|
|
|
| 262 |
|
| 263 |
class Links(BaseModel):
|
| 264 |
links: list[str]
|
| 265 |
+
|
| 266 |
schema = Links.model_json_schema()
|
| 267 |
config_parser = ConfigParser({
|
| 268 |
"page_schema": schema
|
|
|
|
| 302 |
|
| 303 |
JSON output will be organized in a tree-like structure, with the leaf nodes being blocks. Examples of leaf nodes are a single list item, a paragraph of text, or an image.
|
| 304 |
|
| 305 |
+
The output will be a list, with each list item representing a page. Each page is considered a block in the internal marker schema. There are different types of blocks to represent different elements.
|
| 306 |
|
| 307 |
Pages have the keys:
|
| 308 |
|
|
|
|
| 368 |
], // computed PDF table of contents
|
| 369 |
"page_stats": [
|
| 370 |
{
|
| 371 |
+
"page_id": 0,
|
| 372 |
"text_extraction_method": "pdftext",
|
| 373 |
"block_counts": [("Span", 200), ...]
|
| 374 |
},
|