Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Jan 22

Commit

f712a74

1 Parent(s): 21a0fa9

Fix table slicing issues

Browse files

Files changed (6) hide show

README.md +2 -2
marker/processors/llm/llm_form.py +1 -1
marker/processors/llm/llm_table.py +1 -1
marker/processors/llm/llm_table_merge.py +1 -1
marker/processors/table.py +22 -7
marker/scripts/streamlit_app.py +2 -3

README.md CHANGED Viewed

@@ -400,8 +400,8 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
 | Avg score | Total tables | use_llm |
 |-----------|--------------|---------|
-| 0.824     | 54           | False   |
-| 0.873     | 54           | True    |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.

 | Avg score | Total tables | use_llm |
 |-----------|--------------|---------|
+| 0.82      | 54           | False   |
+| 0.887     | 54           | True    |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.

marker/processors/llm/llm_form.py CHANGED Viewed

@@ -17,7 +17,7 @@ Values and labels should appear in html tables, with the labels on the left side
 **Instructions:**
 1. Carefully examine the provided form block image.
 2. Analyze the html representation of the form.
-3. If the html representation is largely correct, then write "No corrections needed."
 4. If the html representation contains errors, generate the corrected html representation.
 5. Output only either the corrected html representation or "No corrections needed."
 **Example:**

 **Instructions:**
 1. Carefully examine the provided form block image.
 2. Analyze the html representation of the form.
+3. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."
 4. If the html representation contains errors, generate the corrected html representation.
 5. Output only either the corrected html representation or "No corrections needed."
 **Example:**

marker/processors/llm/llm_table.py CHANGED Viewed

@@ -37,7 +37,7 @@ Some guidelines:
 **Instructions:**
 1. Carefully examine the provided text block image.
 2. Analyze the html representation of the table.
-3. If the html representation is largely correct, then write "No corrections needed."
 4. If the html representation contains errors, generate the corrected html representation.
 5. Output only either the corrected html representation or "No corrections needed."
 **Example:**

 **Instructions:**
 1. Carefully examine the provided text block image.
 2. Analyze the html representation of the table.
+3. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed."
 4. If the html representation contains errors, generate the corrected html representation.
 5. Output only either the corrected html representation or "No corrections needed."
 **Example:**

marker/processors/llm/llm_table_merge.py CHANGED Viewed

@@ -55,7 +55,7 @@ You'll specify your judgement in json format - first whether Table 2 should be m
 Table 2 should be merged at the bottom of Table 1 if Table 2 has no headers, and the rows have similar values, meaning that Table 2 continues Table 1. Table 2 should be merged to the right of Table 1 if each row in Table 2 matches a row in Table 1, meaning that Table 2 contains additional columns that augment Table 1.
-Only merge Table 1 and Table 2 if Table 2 cannot be interpreted without merging.
 **Instructions:**
 1. Carefully examine the provided table images.  Table 1 is the first image, and Table 2 is the second image.

 Table 2 should be merged at the bottom of Table 1 if Table 2 has no headers, and the rows have similar values, meaning that Table 2 continues Table 1. Table 2 should be merged to the right of Table 1 if each row in Table 2 matches a row in Table 1, meaning that Table 2 contains additional columns that augment Table 1.
+Only merge Table 1 and Table 2 if Table 2 cannot be interpreted without merging.  Only merge Table 1 and Table 2 if you can read both images properly.
 **Instructions:**
 1. Carefully examine the provided table images.  Table 1 is the first image, and Table 2 is the second image.

marker/processors/table.py CHANGED Viewed

@@ -2,6 +2,8 @@ import re
 from collections import defaultdict
 from copy import deepcopy
 from typing import Annotated, List
 from ftfy import fix_text
 from surya.detection import DetectionPredictor
@@ -67,7 +69,7 @@ class TableProcessor(BaseProcessor):
         table_data = []
         for page in document.pages:
             for block in page.contained_blocks(document, self.block_types):
-                image = block.get_image(document, highres=True, expansion=(.01, .01))
                 image_poly = block.polygon.rescale((page.polygon.width, page.polygon.height), page.get_image(highres=True).size)
                 table_data.append({
@@ -165,22 +167,35 @@ class TableProcessor(BaseProcessor):
                 # Other cells that span into this row
                 rowspan_cells = [c for c in table.cells if c.row_id != row and c.row_id + c.rowspan > row > c.row_id]
-                should_split = all([
-                    len(row_cells) > 0,
                     len(rowspan_cells) == 0,
                     all([r == 1 for r in rowspans]),
                     all([l > 1 for l in line_lens]),
                     all([l == line_lens[0] for l in line_lens])
                 ])
                 if should_split:
-                    for i in range(0, line_lens[0]):
                         for cell in row_cells:
-                            line = cell.text_lines[i]
                             cell_id = max_cell_id + new_cell_count
                             new_cells.append(
                                 SuryaTableCell(
-                                    polygon=line["bbox"],
-                                    text_lines=[line],
                                     rowspan=1,
                                     colspan=cell.colspan,
                                     row_id=cell.row_id + shift_up + i,

 from collections import defaultdict
 from copy import deepcopy
 from typing import Annotated, List
+from collections import Counter
+from PIL import ImageDraw
 from ftfy import fix_text
 from surya.detection import DetectionPredictor
         table_data = []
         for page in document.pages:
             for block in page.contained_blocks(document, self.block_types):
+                image = block.get_image(document, highres=True)
                 image_poly = block.polygon.rescale((page.polygon.width, page.polygon.height), page.get_image(highres=True).size)
                 table_data.append({
                 # Other cells that span into this row
                 rowspan_cells = [c for c in table.cells if c.row_id != row and c.row_id + c.rowspan > row > c.row_id]
+                should_split_entire_row = all([
+                    len(row_cells) > 1,
                     len(rowspan_cells) == 0,
                     all([r == 1 for r in rowspans]),
                     all([l > 1 for l in line_lens]),
                     all([l == line_lens[0] for l in line_lens])
                 ])
+                line_lens_counter = Counter(line_lens)
+                counter_keys = sorted(list(line_lens_counter.keys()))
+                should_split_partial_row = all([
+                    len(row_cells) > 3, # Only split if there are more than 3 cells
+                    len(rowspan_cells) == 0,
+                    all([r == 1 for r in rowspans]),
+                    len(line_lens_counter) == 2 and counter_keys[0] <= 1 and counter_keys[1] > 1 and line_lens_counter[counter_keys[0]] == 1, # Allow a single column with a single line - keys are the line lens, values are the counts
+                ])
+                should_split = should_split_entire_row or should_split_partial_row
                 if should_split:
+                    for i in range(0, max(line_lens)):
                         for cell in row_cells:
+                            # Calculate height based on number of splits
+                            split_height = cell.bbox[3] - cell.bbox[1]
+                            current_bbox = [cell.bbox[0], cell.bbox[1] + i * split_height, cell.bbox[2], cell.bbox[1] + (i + 1) * split_height]
+                            line = [cell.text_lines[i]] if cell.text_lines and i < len(cell.text_lines) else None
                             cell_id = max_cell_id + new_cell_count
                             new_cells.append(
                                 SuryaTableCell(
+                                    polygon=current_bbox,
+                                    text_lines=line,
                                     rowspan=1,
                                     colspan=cell.colspan,
                                     row_id=cell.row_id + shift_up + i,

marker/scripts/streamlit_app.py CHANGED Viewed

@@ -1,11 +1,10 @@
 import os
 from marker.settings import settings
 from streamlit.runtime.uploaded_file_manager import UploadedFile
-os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
-os.environ["IN_STREAMLIT"] = "true"
 import base64
 import io
 import re

 import os
+os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
+os.environ["IN_STREAMLIT"] = "true"
 from marker.settings import settings
 from streamlit.runtime.uploaded_file_manager import UploadedFile
 import base64
 import io
 import re