Spaces:

rt4u
/

marker

Sleeping

App Files Files Community

Vik Paruchuri commited on Feb 11

Commit

fd26a53

1 Parent(s): fea4168

README updates

Browse files

Files changed (4) hide show

README.md +39 -17
benchmarks/overall/elo.py +26 -17
benchmarks/table/inference.py +3 -1
marker/processors/table.py +59 -0

README.md CHANGED Viewed

@@ -32,9 +32,9 @@ It only uses models where necessary, which improves speed and accuracy.
 ## Performance
-![Benchmark overall](data/images/overall.png)
-Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix.
 The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
@@ -381,16 +381,33 @@ Pass the `debug` option to activate debug mode.  This will save images of each p
 # Benchmarks
 ## Overall PDF Conversion
-We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.
-| Method     |   Avg Time | Heuristic Score | LLM Score |
-|------------|------------|-----------------|-----------|
-| marker     |    2.83837 | 95.6709         | 4.23916   |
-| llamaparse |   23.348   | 84.2442         | 3.97619   |
-| mathpix    |    6.36223 | 86.4281         | 4.15626   |
-| docling    |  3.86      | 87.7347         | 3.72222   |
-Peak GPU memory usage during the benchmark is `6GB` for marker.  Benchmarks were run on an A10.
 ## Throughput
@@ -400,7 +417,7 @@ We benchmarked throughput using a [single long PDF](https://www.greenteapress.co
 |---------|---------------|-------------------|---------- |
 | marker  | 0.18          | 43.42             |  3.17GB   |
-The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes.
 ## Table Conversion
@@ -408,9 +425,9 @@ Marker can extract tables from PDFs using `marker.converters.table.TableConverte
 | Method           | Avg score | Total tables |
 |------------------|-----------|--------------|
-| marker           | 0.822     | 54           |
 | marker w/use_llm | 0.887     | 54           |
-| gemini           |           | 54           |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.
@@ -438,15 +455,20 @@ Options:
 - `--use_llm` use an llm to improve the marker results.
 - `--max_rows` how many rows to process for the benchmark.
 - `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`.  Comma separated.
-- `--scores` which scoring functions to use, can be `--llm`, `--heuristic`.
 ### Table Conversion
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
 ```shell
-python benchmarks/table/table.py --max_rows 1000
 ```
 # Thanks
 This work would not have been possible without amazing open source models and datasets, including (but not limited to):
@@ -456,4 +478,4 @@ This work would not have been possible without amazing open source models and da
 - Pypdfium2/pdfium
 - DocLayNet from IBM
-Thank you to the authors of these models and datasets for making them available to the community!

 ## Performance
+<img src="data/images/overall.png" width="800px"/>
+Marker benchmarks favorably compared to cloud services like Llamaparse and Mathpix, as well as other open source tools.
 The above results are running single PDF pages serially.  Marker is significantly faster when running in batch mode, with a projected throughput of 122 pages/second on an H100 (.18 seconds per page across 22 processes).
 # Benchmarks
 ## Overall PDF Conversion
+We created a [benchmark set](https://huggingface.co/datasets/datalab-to/marker_benchmark) by extracting single PDF pages from common crawl.  We scored based on a heuristic that aligns text with ground truth text segments, and an LLM as a judge scoring method.
+| Method     | Avg Time | Heuristic Score | LLM Score |
+|------------|----------|-----------------|-----------|
+| marker     | 2.83837  | 95.6709         | 4.23916   |
+| llamaparse | 23.348   | 84.2442         | 3.97619   |
+| mathpix    | 6.36223  | 86.4281         | 4.15626   |
+| docling    | 3.69949  | 86.7073         | 3.70429   |
+Benchmarks were run on an H100 for markjer and docling - llamaparse and mathpix used their cloud services.  We can also look at it by document type:
+<img src="data/images/per_doc.png" width="1000px"/>
+| Document Type        | Marker heuristic | Marker LLM | Llamaparse Heuristic | Llamaparse LLM | Mathpix Heuristic | Mathpix LLM | Docling Heuristic | Docling LLM |
+|----------------------|------------------|------------|----------------------|----------------|-------------------|-------------|-------------------|-------------|
+| Scientific paper     | 96.6737          | 4.34899    | 87.1651              | 3.96421        | 91.2267           | 4.46861     | 92.135            | 3.72422     |
+| Book page            | 97.1846          | 4.16168    | 90.9532              | 4.07186        | 93.8886           | 4.35329     | 90.0556           | 3.64671     |
+| Other                | 95.1632          | 4.25076    | 81.1385              | 4.01835        | 79.6231           | 4.00306     | 83.8223           | 3.76147     |
+| Form                 | 88.0147          | 3.84663    | 66.3081              | 3.68712        | 64.7512           | 3.33129     | 68.3857           | 3.40491     |
+| Presentation         | 95.1562          | 4.13669    | 81.2261              | 4              | 83.6737           | 3.95683     | 84.8405           | 3.86331     |
+| Financial document   | 95.3697          | 4.39106    | 82.5812              | 4.16111        | 81.3115           | 4.05556     | 86.3882           | 3.8         |
+| Letter               | 98.4021          | 4.5        | 93.4477              | 4.28125        | 96.0383           | 4.45312     | 92.0952           | 4.09375     |
+| Engineering document | 93.9244          | 4.04412    | 77.4854              | 3.72059        | 80.3319           | 3.88235     | 79.6807           | 3.42647     |
+| Legal document       | 96.689           | 4.27759    | 86.9769              | 3.87584        | 91.601            | 4.20805     | 87.8383           | 3.65552     |
+| Newspaper page       | 98.8733          | 4.25806    | 84.7492              | 3.90323        | 96.9963           | 4.45161     | 92.6496           | 3.51613     |
+| Magazine page        | 98.2145          | 4.38776    | 87.2902              | 3.97959        | 93.5934           | 4.16327     | 93.0892           | 4.02041     |
 ## Throughput
 |---------|---------------|-------------------|---------- |
 | marker  | 0.18          | 43.42             |  3.17GB   |
+The projected throughput is 122 pages per second on an H100 - we can run 22 individual processes given the VRAM used.
 ## Table Conversion
 | Method           | Avg score | Total tables |
 |------------------|-----------|--------------|
+| marker           | 0.816     | 99           |
 | marker w/use_llm | 0.887     | 54           |
+| gemini           | 0.829     | 99           |
 The `--use_llm` flag can significantly improve table recognition performance, as you can see.
 - `--use_llm` use an llm to improve the marker results.
 - `--max_rows` how many rows to process for the benchmark.
 - `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`.  Comma separated.
+- `--scores` which scoring functions to use, can be `llm`, `heuristic`.  Comma separated.
 ### Table Conversion
 The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
 ```shell
+python benchmarks/table/table.py --max_rows 100
 ```
+Options:
+- `--use_llm` uses an llm with marker to improve accuracy.
+- `--use_gemini` also benchmarks gemini 2.0 flash.
 # Thanks
 This work would not have been possible without amazing open source models and datasets, including (but not limited to):
 - Pypdfium2/pdfium
 - DocLayNet from IBM
+Thank you to the authors of these models and datasets for making them available to the community!

benchmarks/overall/elo.py CHANGED Viewed

@@ -106,7 +106,11 @@ class Comparer:
         version_b: str
     ) -> str | None:
         hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
-        rating = self.llm_rater(img, hydrated_prompt)
         return rating
@@ -142,6 +146,9 @@ class Comparer:
         except APIError as e:
             print(f"Hit Gemini rate limit")
             return
 @dataclass
 class Method:
@@ -189,22 +196,24 @@ def main(
     for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
         row = ds[i]
-        for j in range(row_samples):
-            method_a = random.choice(method_lst)
-            method_b = random.choice(method_lst)
-            if method_a == method_b:
-                continue
-            method_a_md = row[f"{method_a}_md"]
-            method_b_md = row[f"{method_b}_md"]
-            winner = comparer(row["img"], method_a_md, method_b_md)
-            if not winner:
-                continue
-            if winner == "version_a":
-                elo.update_ratings(method_a, method_b)
-            else:
-                elo.update_ratings(method_b, method_a)
         if i % 10 == 0:
             print(elo.methods)

         version_b: str
     ) -> str | None:
         hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
+        try:
+            rating = self.llm_rater(img, hydrated_prompt)
+        except Exception as e:
+            print(f"Error: {e}")
+            return
         return rating
         except APIError as e:
             print(f"Hit Gemini rate limit")
             return
+        except Exception as e:
+            print(f"Error: {e}")
+            return
 @dataclass
 class Method:
     for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
         row = ds[i]
+        # Avoid any bias in ordering
+        random.shuffle(method_lst)
+        for j, method_a in enumerate(method_lst[:-1]):
+            for z, method_b in enumerate(method_lst[j:]):
+                if method_a == method_b:
+                    continue
+                method_a_md = row[f"{method_a}_md"]
+                method_b_md = row[f"{method_b}_md"]
+                winner = comparer(row["img"], method_a_md, method_b_md)
+                if not winner:
+                    continue
+                if winner == "version_a":
+                    elo.update_ratings(method_a, method_b)
+                else:
+                    elo.update_ratings(method_b, method_a)
         if i % 10 == 0:
             print(elo.methods)

benchmarks/table/inference.py CHANGED Viewed

@@ -160,8 +160,10 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
                     tbody.unwrap()
                 for th_tag in marker_table_soup.find_all('th'):
                     th_tag.name = 'td'
                 marker_table_html = str(marker_table_soup)
-                marker_table_html = marker_table_html.replace("<br>", " ")  # Fintabnet uses spaces instead of newlines
                 marker_table_html = marker_table_html.replace("\n", " ")  # Fintabnet uses spaces instead of newlines
                 gemini_table_html = gemini_table.replace("\n", " ")  # Fintabnet uses spaces instead of newlines

                     tbody.unwrap()
                 for th_tag in marker_table_soup.find_all('th'):
                     th_tag.name = 'td'
+                for br_tag in marker_table_soup.find_all('br'):
+                    br_tag.replace_with(marker_table_soup.new_string(''))
                 marker_table_html = str(marker_table_soup)
                 marker_table_html = marker_table_html.replace("\n", " ")  # Fintabnet uses spaces instead of newlines
                 gemini_table_html = gemini_table.replace("\n", " ")  # Fintabnet uses spaces instead of newlines

marker/processors/table.py CHANGED Viewed

@@ -101,6 +101,7 @@ class TableProcessor(BaseProcessor):
         )
         self.assign_text_to_cells(tables, table_data)
         self.split_combined_rows(tables) # Split up rows that were combined
         # Assign table cells to the table
         table_idx = 0
@@ -168,6 +169,64 @@ class TableProcessor(BaseProcessor):
             text = text.replace(space, ' ')
         return text
     def split_combined_rows(self, tables: List[TableResult]):
         for table in tables:
             if len(table.cells) == 0:

         )
         self.assign_text_to_cells(tables, table_data)
         self.split_combined_rows(tables) # Split up rows that were combined
+        self.combine_dollar_column(tables) # Combine columns that are just dollar signs
         # Assign table cells to the table
         table_idx = 0
             text = text.replace(space, ' ')
         return text
+    def combine_dollar_column(self, tables: List[TableResult]):
+        for table in tables:
+            if len(table.cells) == 0:
+                # Skip empty tables
+                continue
+            unique_cols = sorted(list(set([c.col_id for c in table.cells])))
+            max_col = max(unique_cols)
+            dollar_cols = []
+            for col in unique_cols:
+                # Cells in this col
+                col_cells = [c for c in table.cells if c.col_id == col]
+                col_text = ["\n".join(self.finalize_cell_text(c)).strip() for c in col_cells]
+                all_dollars = all([ct in ["", "$"] for ct in col_text])
+                colspans = [c.colspan for c in col_cells]
+                span_into_col = [c for c in table.cells if c.col_id != col and c.col_id + c.colspan > col > c.col_id]
+                # This is a column that is entirely dollar signs
+                if all([
+                    all_dollars,
+                    len(col_cells) > 1,
+                    len(span_into_col) == 0,
+                    all([c == 1 for c in colspans]),
+                    col < max_col
+                ]):
+                    next_col_cells = [c for c in table.cells if c.col_id == col + 1]
+                    next_col_rows = [c.row_id for c in next_col_cells]
+                    col_rows = [c.row_id for c in col_cells]
+                    if len(next_col_cells) == len(col_cells) and next_col_rows == col_rows:
+                        dollar_cols.append(col)
+            if len(dollar_cols) == 0:
+                continue
+            dollar_cols = sorted(dollar_cols)
+            col_offset = 0
+            for col in unique_cols:
+                col_cells = [c for c in table.cells if c.col_id == col]
+                if col_offset == 0 and col not in dollar_cols:
+                    continue
+                if col in dollar_cols:
+                    col_offset += 1
+                    for cell in col_cells:
+                        text_lines = cell.text_lines if cell.text_lines else []
+                        next_row_col = [c for c in table.cells if c.row_id == cell.row_id and c.col_id == col + 1]
+                        # Add dollar to start of the next column
+                        next_text_lines = next_row_col[0].text_lines if next_row_col[0].text_lines else []
+                        next_row_col[0].text_lines = deepcopy(text_lines) + deepcopy(next_text_lines)
+                        table.cells = [c for c in table.cells if c.cell_id != cell.cell_id] # Remove original cell
+                        next_row_col[0].col_id -= col_offset
+                else:
+                    for cell in col_cells:
+                        cell.col_id -= col_offset
     def split_combined_rows(self, tables: List[TableResult]):
         for table in tables:
             if len(table.cells) == 0: