Vik Paruchuri commited on
Commit
25ef0bf
·
1 Parent(s): 4cbdef6

Add examples

Browse files
Files changed (2) hide show
  1. README.md +15 -14
  2. marker/schema/blocks/figure.py +1 -1
README.md CHANGED
@@ -24,11 +24,11 @@ It only uses models where necessary, which improves speed and accuracy.
24
 
25
  ## Examples
26
 
27
- | PDF | File type | Markdown | JSON |
28
- |-----|-----------|----------|------|
29
- | [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/thinkpython.json) |
30
- | [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_transformers.json) |
31
- | [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.md) |
32
 
33
  ## Performance
34
 
@@ -60,7 +60,7 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
60
 
61
  PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
62
 
63
- - Marker will not convert inline equations
64
  - Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
65
  - Forms are not converted optimally
66
  - Very complex layouts, with nested tables and forms, may not work
@@ -181,6 +181,7 @@ Markdown output will include:
181
  - formatted tables
182
  - embedded LaTeX equations (fenced with `$$`)
183
  - Code is fenced with triple backticks
 
184
 
185
  ## HTML
186
 
@@ -325,21 +326,21 @@ Benchmarking PDF extraction quality is hard. I've created a test set by finding
325
 
326
  **Speed**
327
 
328
- | Method | Average Score | Time per page | Time per document |
329
- |--------|---------------|---------------|-------------------|
330
- | marker | 0.618355 | 0.250211 | 23.0194 |
331
 
332
  **Accuracy**
333
 
334
- | Method | multicolcnn.pdf | switch_trans.pdf | thinkpython.pdf | thinkos.pdf | thinkdsp.pdf | crowd.pdf |
335
- |--------|-----------------|------------------|-----------------|-------------|--------------|-----------|
336
- | marker | 0.536176 | 0.516833 | 0.70515 | 0.710657 | 0.690042 | 0.523467 |
337
 
338
- Peak GPU memory usage during the benchmark is `4.1GB` for marker. Benchmarks were run on an A10.
339
 
340
  **Throughput**
341
 
342
- Marker takes about 4GB of VRAM on average per task, so you can convert 12 documents in parallel on an A6000.
343
 
344
  ![Benchmark results](data/images/per_doc.png)
345
 
 
24
 
25
  ## Examples
26
 
27
+ | PDF | File type | Markdown | JSON |
28
+ |-----|-----------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
29
+ | [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/thinkpython.json) |
30
+ | [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_transformers.json) |
31
+ | [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.json) |
32
 
33
  ## Performance
34
 
 
60
 
61
  PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
62
 
63
+ - Marker will only convert block equations
64
  - Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
65
  - Forms are not converted optimally
66
  - Very complex layouts, with nested tables and forms, may not work
 
181
  - formatted tables
182
  - embedded LaTeX equations (fenced with `$$`)
183
  - Code is fenced with triple backticks
184
+ - Superscripts for footnotes
185
 
186
  ## HTML
187
 
 
326
 
327
  **Speed**
328
 
329
+ | Method | Average Score | Time per page | Time per document |
330
+ |---------|----------------|---------------|------------------|
331
+ | marker | 0.625115 | 0.234184 | 21.545 |
332
 
333
  **Accuracy**
334
 
335
+ | Method | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
336
+ |---------|----------------|-----------------|--------------|------------|-------------|----------------|
337
+ | marker | 0.720347 | 0.592002 | 0.70468 | 0.515082 | 0.701394 | 0.517184 |
338
 
339
+ Peak GPU memory usage during the benchmark is `6GB` for marker. Benchmarks were run on an A10.
340
 
341
  **Throughput**
342
 
343
+ Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
344
 
345
  ![Benchmark results](data/images/per_doc.png)
346
 
marker/schema/blocks/figure.py CHANGED
@@ -6,4 +6,4 @@ class Figure(Block):
6
  block_type: BlockTypes = BlockTypes.Figure
7
 
8
  def assemble_html(self, child_blocks, parent_structure):
9
- return f"<p>Image {self.block_id}</p>"
 
6
  block_type: BlockTypes = BlockTypes.Figure
7
 
8
  def assemble_html(self, child_blocks, parent_structure):
9
+ return f"<p>Image {self.id}</p>"