Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Nov 25, 2024

Commit

2b12856

1 Parent(s): 98f060b

Bugfixes

Browse files

Files changed (5) hide show

README.md +5 -6
marker/builders/ocr.py +1 -1
marker/processors/code.py +4 -0
marker/processors/footnote.py +34 -17
marker/schema/blocks/base.py +1 -1

README.md CHANGED Viewed

@@ -25,10 +25,9 @@ It only uses models where necessary, which improves speed and accuracy.
 ## Examples
 | PDF                                                                   | Markdown        | JSON                                                                                                 |
-| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook    | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython.md)     |
-| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf)             | Textbook    | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkos.md)             |
-| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf)           | arXiv paper |  [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
-| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf)              | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md)        |
 ## Performance
@@ -106,7 +105,7 @@ Options:
 - `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
 - `--config_json PATH`: Path to a JSON configuration file containing additional settings.
 - `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "eng,fra,deu"` for English, French, and German.
-- `-l`: List all available builders, processors, and converters, and their associated configuration.  These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
 The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py).  If you don't need OCR, marker can work with any language.
@@ -117,7 +116,7 @@ marker /path/to/input/folder --workers 10
 ```
 - `marker` supports all the same options from `marker_single` above.
-- `--workers` is the number of conversion workers to run simultaneously.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage.  Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
 ## Convert multiple files on multiple GPUs

 ## Examples
 | PDF                                                                   | Markdown        | JSON                                                                                                 |
+| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook    | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython/thinkpython.md)     | |
+| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf)           | arXiv paper |  [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers/switch_transformers.md) |
+| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf)              | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers/multicolcnn.md)        |
 ## Performance
 - `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
 - `--config_json PATH`: Path to a JSON configuration file containing additional settings.
 - `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "eng,fra,deu"` for English, French, and German.
+- `config --help`: List all available builders, processors, and converters, and their associated configuration.  These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
 The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py).  If you don't need OCR, marker can work with any language.
 ```
 - `marker` supports all the same options from `marker_single` above.
+- `--workers` is the number of conversion workers to run simultaneously.  This is set to 5 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage.  Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
 ## Convert multiple files on multiple GPUs

marker/builders/ocr.py CHANGED Viewed

@@ -71,7 +71,7 @@ class OcrBuilder(BaseBuilder):
             det_processor=self.detection_model.processor,
             rec_model=self.recognition_model,
             rec_processor=self.recognition_model.processor,
-            batch_size=int(self.get_recognition_batch_size()),
             highres_images=[page.highres_image for page in page_list]
         )

             det_processor=self.detection_model.processor,
             rec_model=self.recognition_model,
             rec_processor=self.recognition_model.processor,
+            recognition_batch_size=int(self.get_recognition_batch_size()),
             highres_images=[page.highres_image for page in page_list]
         )

marker/processors/code.py CHANGED Viewed

@@ -19,6 +19,10 @@ class CodeProcessor(BaseProcessor):
         min_left = 9999  # will contain x- coord of column 0
         total_width = 0
         total_chars = 0
         for line_id in block.structure:
             line = document.get_block(line_id)
             min_left = min(line.polygon.bbox[0], min_left)

         min_left = 9999  # will contain x- coord of column 0
         total_width = 0
         total_chars = 0
+        if block.structure is None:
+            block.code = ""
+            return
         for line_id in block.structure:
             line = document.get_block(line_id)
             min_left = min(line.polygon.bbox[0], min_left)

marker/processors/footnote.py CHANGED Viewed

@@ -15,45 +15,62 @@ from marker.schema.groups import PageGroup
 class FootnoteProcessor(BaseProcessor):
     """
     A processor for pushing footnotes to the bottom, and relabeling mislabeled text blocks.
     """
     block_types = (BlockTypes.Footnote,)
-    page_bottom_threshold = .66
-    font_size_scaler = .5
-    line_height_scaler = .5
     def __call__(self, document: Document):
         for page in document.pages:
-            self.relabel_texts_to_footnotes(page, document)
             self.push_footnotes_to_bottom(page, document)
-    def relabel_texts_to_footnotes(self, page: PageGroup, document: Document):
         text_blocks = page.contained_blocks(document, (BlockTypes.Text,))
         block_stats = []
         for block in text_blocks:
-            contained_spans = block.contained_blocks(document, (BlockTypes.Span,))
-            font_size = [span.font_size for span in contained_spans]
             contained_lines = block.contained_blocks(document, (BlockTypes.Line,))
             line_heights = [line.polygon.height for line in contained_lines]
             block_stats.append({
-                "font_size": mean(font_size),
-                "line_height": mean(line_heights),
-                "line_heights": line_heights,
-                "font_sizes": font_size,
-                "in_bottom_third": block.polygon.y_end > page.polygon.height * self.page_bottom_threshold
             })
         # Find the average font size and line height
-        avg_font_size = mean([fs for bs in block_stats for fs in bs["font_sizes"]])
-        avg_line_height = mean([lh for bs in block_stats for lh in bs["line_heights"]])
         for text_block, stats_dict in zip(text_blocks, block_stats):
             if all([
-                stats_dict["font_size"] < avg_font_size * self.font_size_scaler,
-                stats_dict["line_height"] < avg_line_height * self.line_height_scaler,
-                stats_dict["in_bottom_third"]
             ]):
                 new_block = Footnote.from_block(text_block)
                 page.replace_block(text_block, new_block)

 class FootnoteProcessor(BaseProcessor):
     """
     A processor for pushing footnotes to the bottom, and relabeling mislabeled text blocks.
+    Attributes:
+        page_bottom_threshold (float):
+            The fraction of page height that is considered the bottom.
+            Default is .75
+        line_height_scaler (float):
+            The amount to scale line height by to consider a block a footnote.
+            Default is .5
     """
     block_types = (BlockTypes.Footnote,)
+    page_bottom_threshold = .75
+    line_height_scaler = .85
     def __call__(self, document: Document):
+        footnote_heights = self.compute_block_stats(document)
+        if len(footnote_heights) == 0:
+            footnote_heights = [999]
+        avg_footnote_height = mean(footnote_heights)
         for page in document.pages:
+            self.relabel_texts_to_footnotes(page, document, avg_footnote_height)
             self.push_footnotes_to_bottom(page, document)
+    def compute_block_stats(self, document: Document):
+        line_heights = []
+        for page in document.pages:
+            for footnote in page.contained_blocks(document, self.block_types):
+                contained_lines = footnote.contained_blocks(document, (BlockTypes.Line,))
+                line_heights.extend([line.polygon.height for line in contained_lines])
+        return line_heights
+    def relabel_texts_to_footnotes(self, page: PageGroup, document: Document, avg_footnote_height: int):
         text_blocks = page.contained_blocks(document, (BlockTypes.Text,))
         block_stats = []
         for block in text_blocks:
             contained_lines = block.contained_blocks(document, (BlockTypes.Line,))
             line_heights = [line.polygon.height for line in contained_lines]
             block_stats.append({
+                "line_height": mean(line_heights) if len(line_heights) > 0 else 999,
+                "in_bottom": block.polygon.y_end > page.polygon.height * self.page_bottom_threshold
             })
         # Find the average font size and line height
+        if len(block_stats) == 0:
+            return
+        height_gap = 1 - self.line_height_scaler
         for text_block, stats_dict in zip(text_blocks, block_stats):
             if all([
+                avg_footnote_height * self.line_height_scaler < stats_dict["line_height"] < avg_footnote_height * (1 + height_gap),
+                stats_dict["in_bottom"]
             ]):
                 new_block = Footnote.from_block(text_block)
                 page.replace_block(text_block, new_block)

marker/schema/blocks/base.py CHANGED Viewed

@@ -77,7 +77,7 @@ class Block(BaseModel):
     @classmethod
     def from_block(cls, block: Block) -> Block:
-        block_attrs = block.model_dump(exclude=["id", "block_id"])
         return cls(**block_attrs)
     def structure_blocks(self, document_page: Document | PageGroup) -> List[Block]:

     @classmethod
     def from_block(cls, block: Block) -> Block:
+        block_attrs = block.model_dump(exclude=["id", "block_id", "block_type"])
         return cls(**block_attrs)
     def structure_blocks(self, document_page: Document | PageGroup) -> List[Block]: