Vik Paruchuri
commited on
Commit
·
25ef0bf
1
Parent(s):
4cbdef6
Add examples
Browse files- README.md +15 -14
- marker/schema/blocks/figure.py +1 -1
README.md
CHANGED
|
@@ -24,11 +24,11 @@ It only uses models where necessary, which improves speed and accuracy.
|
|
| 24 |
|
| 25 |
## Examples
|
| 26 |
|
| 27 |
-
| PDF | File type | Markdown
|
| 28 |
-
|
| 29 |
-
| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython.md)
|
| 30 |
-
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_transformers.json) |
|
| 31 |
-
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn.md)
|
| 32 |
|
| 33 |
## Performance
|
| 34 |
|
|
@@ -60,7 +60,7 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
|
|
| 60 |
|
| 61 |
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
|
| 62 |
|
| 63 |
-
- Marker will
|
| 64 |
- Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
|
| 65 |
- Forms are not converted optimally
|
| 66 |
- Very complex layouts, with nested tables and forms, may not work
|
|
@@ -181,6 +181,7 @@ Markdown output will include:
|
|
| 181 |
- formatted tables
|
| 182 |
- embedded LaTeX equations (fenced with `$$`)
|
| 183 |
- Code is fenced with triple backticks
|
|
|
|
| 184 |
|
| 185 |
## HTML
|
| 186 |
|
|
@@ -325,21 +326,21 @@ Benchmarking PDF extraction quality is hard. I've created a test set by finding
|
|
| 325 |
|
| 326 |
**Speed**
|
| 327 |
|
| 328 |
-
| Method
|
| 329 |
-
|
| 330 |
-
| marker
|
| 331 |
|
| 332 |
**Accuracy**
|
| 333 |
|
| 334 |
-
| Method
|
| 335 |
-
|
| 336 |
-
| marker
|
| 337 |
|
| 338 |
-
Peak GPU memory usage during the benchmark is `
|
| 339 |
|
| 340 |
**Throughput**
|
| 341 |
|
| 342 |
-
Marker takes about
|
| 343 |
|
| 344 |

|
| 345 |
|
|
|
|
| 24 |
|
| 25 |
## Examples
|
| 26 |
|
| 27 |
+
| PDF | File type | Markdown | JSON |
|
| 28 |
+
|-----|-----------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
|
| 29 |
+
| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/thinkpython.json) |
|
| 30 |
+
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_transformers.json) |
|
| 31 |
+
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.json) |
|
| 32 |
|
| 33 |
## Performance
|
| 34 |
|
|
|
|
| 60 |
|
| 61 |
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
|
| 62 |
|
| 63 |
+
- Marker will only convert block equations
|
| 64 |
- Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
|
| 65 |
- Forms are not converted optimally
|
| 66 |
- Very complex layouts, with nested tables and forms, may not work
|
|
|
|
| 181 |
- formatted tables
|
| 182 |
- embedded LaTeX equations (fenced with `$$`)
|
| 183 |
- Code is fenced with triple backticks
|
| 184 |
+
- Superscripts for footnotes
|
| 185 |
|
| 186 |
## HTML
|
| 187 |
|
|
|
|
| 326 |
|
| 327 |
**Speed**
|
| 328 |
|
| 329 |
+
| Method | Average Score | Time per page | Time per document |
|
| 330 |
+
|---------|----------------|---------------|------------------|
|
| 331 |
+
| marker | 0.625115 | 0.234184 | 21.545 |
|
| 332 |
|
| 333 |
**Accuracy**
|
| 334 |
|
| 335 |
+
| Method | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
|
| 336 |
+
|---------|----------------|-----------------|--------------|------------|-------------|----------------|
|
| 337 |
+
| marker | 0.720347 | 0.592002 | 0.70468 | 0.515082 | 0.701394 | 0.517184 |
|
| 338 |
|
| 339 |
+
Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
|
| 340 |
|
| 341 |
**Throughput**
|
| 342 |
|
| 343 |
+
Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
|
| 344 |
|
| 345 |

|
| 346 |
|
marker/schema/blocks/figure.py
CHANGED
|
@@ -6,4 +6,4 @@ class Figure(Block):
|
|
| 6 |
block_type: BlockTypes = BlockTypes.Figure
|
| 7 |
|
| 8 |
def assemble_html(self, child_blocks, parent_structure):
|
| 9 |
-
return f"<p>Image {self.
|
|
|
|
| 6 |
block_type: BlockTypes = BlockTypes.Figure
|
| 7 |
|
| 8 |
def assemble_html(self, child_blocks, parent_structure):
|
| 9 |
+
return f"<p>Image {self.id}</p>"
|