Spaces:

rt4u
/

marker

Sleeping

Vik Paruchuri commited on Jan 7

Commit

3453dd8

1 Parent(s): 818619c

Small bugfix

Files changed (4) hide show

README.md CHANGED Viewed

@@ -104,6 +104,7 @@ marker_single /path/to/file.pdf
 Options:
 - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
 - `--output_format [markdown|json|html]`: Specify the format for the output results.
 - `--use_llm`: Uses an LLM to improve accuracy.  You must set your Gemini API key using the `GOOGLE_API_KEY` env var.
 - `--disable_image_extraction`: Don't extract images from the PDF.  If you also specify `--use_llm`, then images will be replaced with a description.
 - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.

 Options:
 - `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
 - `--output_format [markdown|json|html]`: Specify the format for the output results.
+- `--paginate_output`: Paginates the output, using `\n\n{PAGE_NUMBER}` followed by `-` * 48, then `\n\n`
 - `--use_llm`: Uses an LLM to improve accuracy.  You must set your Gemini API key using the `GOOGLE_API_KEY` env var.
 - `--disable_image_extraction`: Don't extract images from the PDF.  If you also specify `--use_llm`, then images will be replaced with a description.
 - `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.

marker/processors/llm/llm_table_merge.py ADDED Viewed

File without changes

marker/processors/llm/utils.py CHANGED Viewed

@@ -35,7 +35,7 @@ class GoogleModel:
         while tries < max_retries:
             try:
                 responses = self.model.generate_content(
-                    [prompt, image],
                     stream=False,
                     generation_config={
                         "temperature": 0,

         while tries < max_retries:
             try:
                 responses = self.model.generate_content(
+                    [image, prompt], # According to gemini docs, it performs better if the image is the first element
                     stream=False,
                     generation_config={
                         "temperature": 0,

marker/processors/sectionheader.py CHANGED Viewed

@@ -54,11 +54,8 @@ class SectionHeaderProcessor(BaseProcessor):
         heading_ranges = self.bucket_headings(flat_line_heights)
         for page in document.pages:
-            for block in page.children:
-                if block.block_type not in self.block_types:
-                    continue
-                block_height = line_heights[block.id]
                 if block_height > 0:
                     for idx, (min_height, max_height) in enumerate(heading_ranges):
                         if block_height >= min_height * self.height_tolerance:

         heading_ranges = self.bucket_headings(flat_line_heights)
         for page in document.pages:
+            for block in page.contained_blocks(document, self.block_types):
+                block_height = line_heights.get(block.id, 0)
                 if block_height > 0:
                     for idx, (min_height, max_height) in enumerate(heading_ranges):
                         if block_height >= min_height * self.height_tolerance: