Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Nov 26, 2024

Commit

25ef0bf

1 Parent(s): 4cbdef6

Add examples

Browse files

Files changed (2) hide show

README.md +15 -14
marker/schema/blocks/figure.py +1 -1

README.md CHANGED Viewed

@@ -24,11 +24,11 @@ It only uses models where necessary, which improves speed and accuracy.
 ## Examples
-| PDF | File type | Markdown | JSON |
-|-----|-----------|----------|------|
-| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/thinkpython.json) |
-| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_transformers.json) |
-| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.md) |
 ## Performance
@@ -60,7 +60,7 @@ There's a hosted API for marker available [here](https://www.datalab.to/):
 PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
-- Marker will not convert inline equations
 - Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
 - Forms are not converted optimally
 - Very complex layouts, with nested tables and forms, may not work
@@ -181,6 +181,7 @@ Markdown output will include:
 - formatted tables
 - embedded LaTeX equations (fenced with `$$`)
 - Code is fenced with triple backticks
 ## HTML
@@ -325,21 +326,21 @@ Benchmarking PDF extraction quality is hard.  I've created a test set by finding
 **Speed**
-| Method | Average Score | Time per page | Time per document |
-|--------|---------------|---------------|-------------------|
-| marker | 0.618355      | 0.250211      | 23.0194           |
 **Accuracy**
-| Method | multicolcnn.pdf | switch_trans.pdf | thinkpython.pdf | thinkos.pdf | thinkdsp.pdf | crowd.pdf |
-|--------|-----------------|------------------|-----------------|-------------|--------------|-----------|
-| marker | 0.536176        | 0.516833         | 0.70515         | 0.710657    | 0.690042     | 0.523467  |
-Peak GPU memory usage during the benchmark is `4.1GB` for marker.  Benchmarks were run on an A10.
 **Throughput**
-Marker takes about 4GB of VRAM on average per task, so you can convert 12 documents in parallel on an A6000.
 ![Benchmark results](data/images/per_doc.png)

 ## Examples
+| PDF | File type | Markdown                                                                                                                     | JSON                                                                                                   |
+|-----|-----------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/thinkpython/thinkpython.md)                 | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/thinkpython.json)         |
+| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/switch_transformers/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/switch_transformers.json) |
+| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/markdown/multicolcnn/multicolcnn.md)                 | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/json/multicolcnn.json)         |
 ## Performance
 PDF is a tricky format, so marker will not always work perfectly.  Here are some known limitations that are on the roadmap to address:
+- Marker will only convert block equations
 - Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
 - Forms are not converted optimally
 - Very complex layouts, with nested tables and forms, may not work
 - formatted tables
 - embedded LaTeX equations (fenced with `$$`)
 - Code is fenced with triple backticks
+- Superscripts for footnotes
 ## HTML
 **Speed**
+| Method  | Average Score | Time per page | Time per document |
+|---------|----------------|---------------|------------------|
+| marker  | 0.625115       | 0.234184     | 21.545           |
 **Accuracy**
+| Method  | thinkpython.pdf | switch_trans.pdf | thinkdsp.pdf | crowd.pdf | thinkos.pdf | multicolcnn.pdf |
+|---------|----------------|-----------------|--------------|------------|-------------|----------------|
+| marker  | 0.720347       | 0.592002       | 0.70468     | 0.515082   | 0.701394    | 0.517184      |
+Peak GPU memory usage during the benchmark is `6GB` for marker.  Benchmarks were run on an A10.
 **Throughput**
+Marker takes about 6GB of VRAM on average per task, so you can convert 8 documents in parallel on an A6000.
 ![Benchmark results](data/images/per_doc.png)

marker/schema/blocks/figure.py CHANGED Viewed

@@ -6,4 +6,4 @@ class Figure(Block):
     block_type: BlockTypes = BlockTypes.Figure
     def assemble_html(self, child_blocks, parent_structure):
-        return f"<p>Image {self.block_id}</p>"

     block_type: BlockTypes = BlockTypes.Figure
     def assemble_html(self, child_blocks, parent_structure):
+        return f"<p>Image {self.id}</p>"