Spaces:

rt4u
/

marker

Sleeping

Vik Paruchuri commited on Jan 15

Commit

1deee8c

1 Parent(s): 1dfe667

Clean up table output

Files changed (4) hide show

README.md CHANGED Viewed

@@ -394,12 +394,15 @@ Marker takes about 6GB of VRAM on average per task, so you can convert 8 documen
 ![Benchmark results](data/images/per_doc.png)
 ## Table Conversion
-Marker can extract tables from your PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a [tree edit distance] based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves an average score of `0.65` via this approach.
 |   Avg score |   Total tables |
 |-------------|----------------|
 |        0.65 |           1149 |
 ## Running your own benchmarks
 You can benchmark the performance of marker on your machine. Install marker manually with:

 ![Benchmark results](data/images/per_doc.png)
 ## Table Conversion
+Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
 |   Avg score |   Total tables |
 |-------------|----------------|
 |        0.65 |           1149 |
+We filter out tables that we cannot align with the ground truth, since fintabnet and our layout model have slightly different detection methods (this results in some tables being split/merged).
 ## Running your own benchmarks
 You can benchmark the performance of marker on your machine. Install marker manually with:

benchmarks/table/table.py CHANGED Viewed

@@ -152,6 +152,7 @@ def main(out_file: str, dataset: str, max_rows: int, max_workers: int, use_llm:
                 for th_tag in marker_table_soup.find_all('th'):
                     th_tag.name = 'td'
                 marker_table_html = str(marker_table_soup)
                 results.append({
                     "marker_table": marker_table_html,

                 for th_tag in marker_table_soup.find_all('th'):
                     th_tag.name = 'td'
                 marker_table_html = str(marker_table_soup)
+                marker_table_html = marker_table_html.replace("\n", " ") # Fintabnet uses spaces instead of newlines
                 results.append({
                     "marker_table": marker_table_html,

marker/processors/table.py CHANGED Viewed

@@ -114,8 +114,8 @@ class TableProcessor(BaseProcessor):
     def finalize_cell_text(self, cell: SuryaTableCell):
         text = "\n".join([t["text"].strip() for t in cell.text_lines]) if cell.text_lines else ""
-        text = re.sub(r"(\s\.){3,}", "...", text)  # Replace . . .
-        text = re.sub(r"\.{3,}", "...", text)  # Replace ..., like in table of contents
         return self.normalize_spaces(fix_text(text))
     @staticmethod

     def finalize_cell_text(self, cell: SuryaTableCell):
         text = "\n".join([t["text"].strip() for t in cell.text_lines]) if cell.text_lines else ""
+        text = re.sub(r"(\s\.){2,}", "", text)  # Replace . . .
+        text = re.sub(r"\.{2,}", "", text)  # Replace ..., like in table of contents
         return self.normalize_spaces(fix_text(text))
     @staticmethod

marker/schema/blocks/tablecell.py CHANGED Viewed

@@ -13,5 +13,10 @@ class TableCell(Block):
     block_description: str = "A cell in a table."
     def assemble_html(self, document, child_blocks, parent_structure=None):
-        tag = "th" if self.is_header else "td"
-        return f"<{tag} rowspan={self.rowspan} colspan={self.colspan}>{self.text}</{tag}>"

     block_description: str = "A cell in a table."
     def assemble_html(self, document, child_blocks, parent_structure=None):
+        tag_cls = "th" if self.is_header else "td"
+        tag = f"<{tag_cls}"
+        if self.rowspan > 1:
+            tag += f" rowspan={self.rowspan}"
+        if self.colspan > 1:
+            tag += f" colspan={self.colspan}"
+        return f"{tag}>{self.text}</{tag_cls}>"