Vik Paruchuri
commited on
Commit
·
fea4168
1
Parent(s):
d82df96
Add elo ratings
Browse files- README.md +13 -5
- benchmarks/overall/elo.py +216 -0
- benchmarks/overall/overall.py +1 -1
- benchmarks/table/inference.py +19 -3
- marker/processors/llm/llm_table.py +5 -3
- marker/processors/llm/llm_table_merge.py +1 -1
- marker/renderers/markdown.py +1 -1
README.md
CHANGED
|
@@ -406,10 +406,11 @@ The projected throughput is 122 pages per second on an H100 - we can run 22 indi
|
|
| 406 |
|
| 407 |
Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
|
| 408 |
|
| 409 |
-
| Avg score | Total tables |
|
| 410 |
-
|
| 411 |
-
| 0.822 | 54 |
|
| 412 |
-
| 0.887 | 54 |
|
|
|
|
| 413 |
|
| 414 |
The `--use_llm` flag can significantly improve table recognition performance, as you can see.
|
| 415 |
|
|
@@ -429,9 +430,16 @@ poetry install
|
|
| 429 |
Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
|
| 430 |
|
| 431 |
```shell
|
| 432 |
-
python benchmarks/overall.py
|
| 433 |
```
|
| 434 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 435 |
### Table Conversion
|
| 436 |
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
|
| 437 |
|
|
|
|
| 406 |
|
| 407 |
Marker can extract tables from PDFs using `marker.converters.table.TableConverter`. The table extraction performance is measured by comparing the extracted HTML representation of tables against the original HTML representations using the test split of [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). The HTML representations are compared using a tree edit distance based metric to judge both structure and content. Marker detects and identifies the structure of all tables in a PDF page and achieves these scores:
|
| 408 |
|
| 409 |
+
| Method | Avg score | Total tables |
|
| 410 |
+
|------------------|-----------|--------------|
|
| 411 |
+
| marker | 0.822 | 54 |
|
| 412 |
+
| marker w/use_llm | 0.887 | 54 |
|
| 413 |
+
| gemini | | 54 |
|
| 414 |
|
| 415 |
The `--use_llm` flag can significantly improve table recognition performance, as you can see.
|
| 416 |
|
|
|
|
| 430 |
Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
|
| 431 |
|
| 432 |
```shell
|
| 433 |
+
python benchmarks/overall.py --methods marker --scores heuristic,llm
|
| 434 |
```
|
| 435 |
|
| 436 |
+
Options:
|
| 437 |
+
|
| 438 |
+
- `--use_llm` use an llm to improve the marker results.
|
| 439 |
+
- `--max_rows` how many rows to process for the benchmark.
|
| 440 |
+
- `--methods` can be `llamaparse`, `mathpix`, `docling`, `marker`. Comma separated.
|
| 441 |
+
- `--scores` which scoring functions to use, can be `--llm`, `--heuristic`.
|
| 442 |
+
|
| 443 |
### Table Conversion
|
| 444 |
The processed FinTabNet dataset is hosted [here](https://huggingface.co/datasets/datalab-to/fintabnet-test) and is automatically downloaded. Run the benchmark with:
|
| 445 |
|
benchmarks/overall/elo.py
ADDED
|
@@ -0,0 +1,216 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import random
|
| 3 |
+
import time
|
| 4 |
+
from dataclasses import dataclass
|
| 5 |
+
from typing import List, Dict, Tuple, Literal
|
| 6 |
+
from PIL import Image
|
| 7 |
+
|
| 8 |
+
import click
|
| 9 |
+
import datasets
|
| 10 |
+
from google import genai
|
| 11 |
+
from google.genai.errors import APIError
|
| 12 |
+
from pydantic import BaseModel
|
| 13 |
+
from tqdm import tqdm
|
| 14 |
+
|
| 15 |
+
from marker.settings import settings
|
| 16 |
+
|
| 17 |
+
rating_prompt = """
|
| 18 |
+
You're a document analysis expert who is comparing two different markdown samples to an image to see which one represents the content of the image better. The markdown will be called version A and version B.
|
| 19 |
+
|
| 20 |
+
Here are some notes on the image and markdown:
|
| 21 |
+
- Some parts of the page may have been recognized as images and linked from the markdown, like ``.
|
| 22 |
+
- Tables will be formatted as Github flavored markdown.
|
| 23 |
+
- Block equations will be in LaTeX.
|
| 24 |
+
- The image and markdown may be in any language.
|
| 25 |
+
- The markdown is based on the text extracted from the document, and sometimes the document may have had bad OCR applied to it, resulting in gibberish text.
|
| 26 |
+
|
| 27 |
+
The markdown should fully capture the meaning and formatting of the text in the image. You'll evaluate the markdown based on the image provided.
|
| 28 |
+
|
| 29 |
+
**Instructions**
|
| 30 |
+
Follow this process to evaluate the markdown:
|
| 31 |
+
1. Carefully examine the image.
|
| 32 |
+
2. Carefully examine the first markdown input provided.
|
| 33 |
+
3. Describe how well version a represents the image.
|
| 34 |
+
4. Carefully examine the second markdown input provided.
|
| 35 |
+
5. Describe how well version B represents the image.
|
| 36 |
+
6. Compare version A and version B.
|
| 37 |
+
7. Decide which markdown representation is better, based on the criteria below. Output version_a if version a is better, and version_b if version b is better.
|
| 38 |
+
|
| 39 |
+
Use these criteria when judging the markdown:
|
| 40 |
+
- Overall - the overall quality of the markdown as compared to the image.
|
| 41 |
+
- Text quality - the quality of the text extraction from the image.
|
| 42 |
+
- Formatting quality - the quality of the formatting applied to the markdown, as compared to the image.
|
| 43 |
+
- Tables - how effectively the tables have been extracted and formatted.
|
| 44 |
+
- Forms - how effectively the forms have extracted and formatted.
|
| 45 |
+
- Equations - how effectively block equations have been converted to LaTeX.
|
| 46 |
+
- Lists - if the lists have been properly extracted and formatted.
|
| 47 |
+
- Images - if images are identified and placed correctly.
|
| 48 |
+
|
| 49 |
+
Notes on scoring:
|
| 50 |
+
- Perfect markdown will include all of the important text from the image, and the formatting will be correct (minor mistakes okay). It's okay to omit some text that isn't important to the meaning, like page numbers and chapter headings. If the entire page is an image, it's okay if the markdown is just a link to the image, unless the image would be better represented as text.
|
| 51 |
+
- Bad markdown will have major missing text segments from the markdown or completely unreadable formatting.
|
| 52 |
+
|
| 53 |
+
Output json, like in the example below.
|
| 54 |
+
|
| 55 |
+
**Example**
|
| 56 |
+
Version A
|
| 57 |
+
```markdown
|
| 58 |
+
# *Section 1*
|
| 59 |
+
This is some *markdown* extracted from a document. Here is a block equation:
|
| 60 |
+
$$\frac{ab \cdot x^5 + x^2 + 2 \cdot x + 123}{t}$$
|
| 61 |
+
```
|
| 62 |
+
Version B
|
| 63 |
+
```markdown
|
| 64 |
+
# Section 1
|
| 65 |
+
This is some markdown extracted from a document. Here is a block equation:
|
| 66 |
+
$$\frac{ab \cdot x^5 + x^2 + 2 \cdot x + 123}{t}$$
|
| 67 |
+
```
|
| 68 |
+
Output
|
| 69 |
+
```json
|
| 70 |
+
{
|
| 71 |
+
"image_description": "In the image, there is a section header 'Section 1', followed by some text and a block equation.",
|
| 72 |
+
"version_a_description": "In the markdown, there is a section header 'Section 1', followed by some text and a block equation.",
|
| 73 |
+
"version_b_description": "In the markdown, there is a section header 'Section 1', followed by some text and a block equation. The formatting in version b is slightly different from the image.",
|
| 74 |
+
"comparison": "Version A is better than version B. The text and formatting in version A matches the image better than version B.",
|
| 75 |
+
"winner": "version_a",
|
| 76 |
+
}
|
| 77 |
+
```
|
| 78 |
+
**Input**
|
| 79 |
+
Version A
|
| 80 |
+
```markdown
|
| 81 |
+
{{version_a}}
|
| 82 |
+
```
|
| 83 |
+
Version B
|
| 84 |
+
```markdown
|
| 85 |
+
{{version_b}}
|
| 86 |
+
```
|
| 87 |
+
**Output**
|
| 88 |
+
"""
|
| 89 |
+
|
| 90 |
+
class ComparerSchema(BaseModel):
|
| 91 |
+
image_description: str
|
| 92 |
+
version_a_description: str
|
| 93 |
+
version_b_description: str
|
| 94 |
+
comparison: str
|
| 95 |
+
winner: Literal["version_a", "version_b"]
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
class Comparer:
|
| 99 |
+
def __init__(self):
|
| 100 |
+
pass
|
| 101 |
+
|
| 102 |
+
def __call__(
|
| 103 |
+
self,
|
| 104 |
+
img: Image.Image,
|
| 105 |
+
version_a: str,
|
| 106 |
+
version_b: str
|
| 107 |
+
) -> str | None:
|
| 108 |
+
hydrated_prompt = rating_prompt.replace("{{version_a}}", version_a).replace("{{version_b}}", version_b)
|
| 109 |
+
rating = self.llm_rater(img, hydrated_prompt)
|
| 110 |
+
return rating
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def llm_rater(self, img: Image.Image, prompt: str):
|
| 114 |
+
response = self.llm_response_wrapper(
|
| 115 |
+
[img, prompt],
|
| 116 |
+
ComparerSchema
|
| 117 |
+
)
|
| 118 |
+
assert "winner" in response, f"Response missing 'winner' key: {response}"
|
| 119 |
+
return response["winner"]
|
| 120 |
+
|
| 121 |
+
def llm_response_wrapper(
|
| 122 |
+
self,
|
| 123 |
+
prompt,
|
| 124 |
+
response_schema,
|
| 125 |
+
):
|
| 126 |
+
client = genai.Client(
|
| 127 |
+
api_key=settings.GOOGLE_API_KEY,
|
| 128 |
+
http_options={"timeout": 60000}
|
| 129 |
+
)
|
| 130 |
+
try:
|
| 131 |
+
responses = client.models.generate_content(
|
| 132 |
+
model="gemini-2.0-flash",
|
| 133 |
+
contents=prompt,
|
| 134 |
+
config={
|
| 135 |
+
"temperature": 0,
|
| 136 |
+
"response_schema": response_schema,
|
| 137 |
+
"response_mime_type": "application/json",
|
| 138 |
+
},
|
| 139 |
+
)
|
| 140 |
+
output = responses.candidates[0].content.parts[0].text
|
| 141 |
+
return json.loads(output)
|
| 142 |
+
except APIError as e:
|
| 143 |
+
print(f"Hit Gemini rate limit")
|
| 144 |
+
return
|
| 145 |
+
|
| 146 |
+
@dataclass
|
| 147 |
+
class Method:
|
| 148 |
+
name: str
|
| 149 |
+
rating: float = 1500
|
| 150 |
+
k_factor: float = 32
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
class EloSystem:
|
| 154 |
+
def __init__(self, player_names: List[str]):
|
| 155 |
+
self.methods = {name: Method(name) for name in player_names}
|
| 156 |
+
|
| 157 |
+
def expected_score(self, rating_a: float, rating_b: float) -> float:
|
| 158 |
+
return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
|
| 159 |
+
|
| 160 |
+
def update_ratings(self, winner: str, loser: str) -> Tuple[float, float]:
|
| 161 |
+
method_a = self.methods[winner]
|
| 162 |
+
method_b = self.methods[loser]
|
| 163 |
+
|
| 164 |
+
expected_a = self.expected_score(method_a.rating, method_b.rating)
|
| 165 |
+
expected_b = self.expected_score(method_b.rating, method_a.rating)
|
| 166 |
+
|
| 167 |
+
# Winner gets score of 1, loser gets 0
|
| 168 |
+
method_a.rating += method_a.k_factor * (1 - expected_a)
|
| 169 |
+
method_b.rating += method_b.k_factor * (0 - expected_b)
|
| 170 |
+
|
| 171 |
+
return method_a.rating, method_b.rating
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
@click.command("Calculate ELO scores for document conversion methods")
|
| 175 |
+
@click.argument("dataset", type=str)
|
| 176 |
+
@click.option("--methods", type=str, help="List of methods to compare: comma separated like marker,mathpix")
|
| 177 |
+
@click.option("--row_samples", type=int, default=2, help="Number of samples per row")
|
| 178 |
+
@click.option("--max_rows", type=int, default=100, help="Maximum number of rows to process")
|
| 179 |
+
def main(
|
| 180 |
+
dataset: str,
|
| 181 |
+
methods: str,
|
| 182 |
+
row_samples: int,
|
| 183 |
+
max_rows: int
|
| 184 |
+
):
|
| 185 |
+
ds = datasets.load_dataset(dataset, split="train")
|
| 186 |
+
method_lst = methods.split(",")
|
| 187 |
+
elo = EloSystem(method_lst)
|
| 188 |
+
comparer = Comparer()
|
| 189 |
+
|
| 190 |
+
for i in tqdm(range(min(len(ds), max_rows)), desc="Calculating ELO"):
|
| 191 |
+
row = ds[i]
|
| 192 |
+
for j in range(row_samples):
|
| 193 |
+
method_a = random.choice(method_lst)
|
| 194 |
+
method_b = random.choice(method_lst)
|
| 195 |
+
if method_a == method_b:
|
| 196 |
+
continue
|
| 197 |
+
|
| 198 |
+
method_a_md = row[f"{method_a}_md"]
|
| 199 |
+
method_b_md = row[f"{method_b}_md"]
|
| 200 |
+
winner = comparer(row["img"], method_a_md, method_b_md)
|
| 201 |
+
if not winner:
|
| 202 |
+
continue
|
| 203 |
+
|
| 204 |
+
if winner == "version_a":
|
| 205 |
+
elo.update_ratings(method_a, method_b)
|
| 206 |
+
else:
|
| 207 |
+
elo.update_ratings(method_b, method_a)
|
| 208 |
+
if i % 10 == 0:
|
| 209 |
+
print(elo.methods)
|
| 210 |
+
|
| 211 |
+
# Print out ratings
|
| 212 |
+
print(elo.methods)
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
if __name__ == "__main__":
|
| 216 |
+
main()
|
benchmarks/overall/overall.py
CHANGED
|
@@ -80,7 +80,7 @@ def get_method_scores(benchmark_dataset: datasets.Dataset, methods: List[str], s
|
|
| 80 |
@click.command(help="Benchmark PDF to MD conversion.")
|
| 81 |
@click.option("--dataset", type=str, help="Path to the benchmark dataset", default="datalab-to/marker_benchmark")
|
| 82 |
@click.option("--out_dataset", type=str, help="Path to the output dataset", default=None)
|
| 83 |
-
@click.option("--methods", type=str, help="Comma separated list of other methods to compare against. Possible values: marker,mathpix,llamaparse", default="marker")
|
| 84 |
@click.option("--scores", type=str, help="Comma separated list of scoring functions to use. Possible values: heuristic,llm", default="heuristic")
|
| 85 |
@click.option("--result_path", type=str, default=os.path.join(settings.OUTPUT_DIR, "benchmark", "overall"), help="Output path for results.")
|
| 86 |
@click.option("--max_rows", type=int, default=None, help="Maximum number of rows to process.")
|
|
|
|
| 80 |
@click.command(help="Benchmark PDF to MD conversion.")
|
| 81 |
@click.option("--dataset", type=str, help="Path to the benchmark dataset", default="datalab-to/marker_benchmark")
|
| 82 |
@click.option("--out_dataset", type=str, help="Path to the output dataset", default=None)
|
| 83 |
+
@click.option("--methods", type=str, help="Comma separated list of other methods to compare against. Possible values: marker,mathpix,llamaparse,docling", default="marker")
|
| 84 |
@click.option("--scores", type=str, help="Comma separated list of scoring functions to use. Possible values: heuristic,llm", default="heuristic")
|
| 85 |
@click.option("--result_path", type=str, default=os.path.join(settings.OUTPUT_DIR, "benchmark", "overall"), help="Output path for results.")
|
| 86 |
@click.option("--max_rows", type=int, default=None, help="Maximum number of rows to process.")
|
benchmarks/table/inference.py
CHANGED
|
@@ -11,6 +11,8 @@ from benchmarks.table.gemini import gemini_table_rec
|
|
| 11 |
from marker.config.parser import ConfigParser
|
| 12 |
from marker.converters.table import TableConverter
|
| 13 |
from marker.models import create_model_dict
|
|
|
|
|
|
|
| 14 |
from marker.renderers.json import JSONBlockOutput
|
| 15 |
from marker.schema.polygon import PolygonBox
|
| 16 |
from marker.util import matrix_intersection_area
|
|
@@ -42,10 +44,14 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
|
|
| 42 |
pdf_binary = base64.b64decode(row['pdf'])
|
| 43 |
gt_tables = row['tables'] # Already sorted by reading order, which is what marker returns
|
| 44 |
|
|
|
|
| 45 |
converter = TableConverter(
|
| 46 |
config=config_parser.generate_config_dict(),
|
| 47 |
artifact_dict=models,
|
| 48 |
-
processor_list=
|
|
|
|
|
|
|
|
|
|
| 49 |
renderer=config_parser.get_renderer()
|
| 50 |
)
|
| 51 |
|
|
@@ -67,6 +73,11 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
|
|
| 67 |
marker_table_boxes = [table.bbox for table in marker_tables]
|
| 68 |
page_bbox = marker_json[0].bbox
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
table_images = [
|
| 71 |
page_image.crop(
|
| 72 |
PolygonBox.from_bbox(bbox)
|
|
@@ -102,6 +113,11 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
|
|
| 102 |
unaligned_tables.add(table_idx)
|
| 103 |
continue
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
if aligned_idx in used_tables:
|
| 106 |
# Marker table already aligned with another gt table
|
| 107 |
unaligned_tables.add(table_idx)
|
|
@@ -109,13 +125,13 @@ def inference_tables(dataset, use_llm: bool, table_rec_batch_size: int | None, m
|
|
| 109 |
|
| 110 |
# Gt table doesn't align well with any marker table
|
| 111 |
gt_table_pct = gt_areas[table_idx] / max_area
|
| 112 |
-
if not .
|
| 113 |
unaligned_tables.add(table_idx)
|
| 114 |
continue
|
| 115 |
|
| 116 |
# Marker table doesn't align with gt table
|
| 117 |
marker_table_pct = marker_areas[aligned_idx] / max_area
|
| 118 |
-
if not .
|
| 119 |
unaligned_tables.add(table_idx)
|
| 120 |
continue
|
| 121 |
|
|
|
|
| 11 |
from marker.config.parser import ConfigParser
|
| 12 |
from marker.converters.table import TableConverter
|
| 13 |
from marker.models import create_model_dict
|
| 14 |
+
from marker.processors.llm.llm_table import LLMTableProcessor
|
| 15 |
+
from marker.processors.table import TableProcessor
|
| 16 |
from marker.renderers.json import JSONBlockOutput
|
| 17 |
from marker.schema.polygon import PolygonBox
|
| 18 |
from marker.util import matrix_intersection_area
|
|
|
|
| 44 |
pdf_binary = base64.b64decode(row['pdf'])
|
| 45 |
gt_tables = row['tables'] # Already sorted by reading order, which is what marker returns
|
| 46 |
|
| 47 |
+
# Only use the basic table processors
|
| 48 |
converter = TableConverter(
|
| 49 |
config=config_parser.generate_config_dict(),
|
| 50 |
artifact_dict=models,
|
| 51 |
+
processor_list=[
|
| 52 |
+
"marker.processors.table.TableProcessor",
|
| 53 |
+
"marker.processors.llm.llm_table.LLMTableProcessor",
|
| 54 |
+
],
|
| 55 |
renderer=config_parser.get_renderer()
|
| 56 |
)
|
| 57 |
|
|
|
|
| 73 |
marker_table_boxes = [table.bbox for table in marker_tables]
|
| 74 |
page_bbox = marker_json[0].bbox
|
| 75 |
|
| 76 |
+
if len(marker_tables) != len(gt_tables):
|
| 77 |
+
print(f'Number of tables do not match, skipping...')
|
| 78 |
+
total_unaligned += len(gt_tables)
|
| 79 |
+
continue
|
| 80 |
+
|
| 81 |
table_images = [
|
| 82 |
page_image.crop(
|
| 83 |
PolygonBox.from_bbox(bbox)
|
|
|
|
| 113 |
unaligned_tables.add(table_idx)
|
| 114 |
continue
|
| 115 |
|
| 116 |
+
if max_area <= .01:
|
| 117 |
+
# No alignment found
|
| 118 |
+
unaligned_tables.add(table_idx)
|
| 119 |
+
continue
|
| 120 |
+
|
| 121 |
if aligned_idx in used_tables:
|
| 122 |
# Marker table already aligned with another gt table
|
| 123 |
unaligned_tables.add(table_idx)
|
|
|
|
| 125 |
|
| 126 |
# Gt table doesn't align well with any marker table
|
| 127 |
gt_table_pct = gt_areas[table_idx] / max_area
|
| 128 |
+
if not .85 < gt_table_pct < 1.15:
|
| 129 |
unaligned_tables.add(table_idx)
|
| 130 |
continue
|
| 131 |
|
| 132 |
# Marker table doesn't align with gt table
|
| 133 |
marker_table_pct = marker_areas[aligned_idx] / max_area
|
| 134 |
+
if not .85 < marker_table_pct < 1.15:
|
| 135 |
unaligned_tables.add(table_idx)
|
| 136 |
continue
|
| 137 |
|
marker/processors/llm/llm_table.py
CHANGED
|
@@ -42,13 +42,13 @@ Some guidelines:
|
|
| 42 |
- If you see any math in a table cell, fence it with the <math display="inline"> tag. Block math should be fenced with <math display="block">.
|
| 43 |
- Replace any images with a description, like "Image: [description]".
|
| 44 |
- Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
|
|
|
|
| 45 |
|
| 46 |
**Instructions:**
|
| 47 |
1. Carefully examine the provided text block image.
|
| 48 |
2. Analyze the html representation of the table.
|
| 49 |
-
3.
|
| 50 |
-
4. If the html representation contains errors, generate the corrected html representation.
|
| 51 |
-
5. Output only either the corrected html representation or "No corrections needed."
|
| 52 |
**Example:**
|
| 53 |
Input:
|
| 54 |
```html
|
|
@@ -67,6 +67,7 @@ Input:
|
|
| 67 |
```
|
| 68 |
Output:
|
| 69 |
```html
|
|
|
|
| 70 |
No corrections needed.
|
| 71 |
```
|
| 72 |
**Input:**
|
|
@@ -237,4 +238,5 @@ No corrections needed.
|
|
| 237 |
return cells
|
| 238 |
|
| 239 |
class TableSchema(BaseModel):
|
|
|
|
| 240 |
correct_html: str
|
|
|
|
| 42 |
- If you see any math in a table cell, fence it with the <math display="inline"> tag. Block math should be fenced with <math display="block">.
|
| 43 |
- Replace any images with a description, like "Image: [description]".
|
| 44 |
- Only use the tags th, td, tr, br, span, i, b, math, and table. Only use the attributes display, style, colspan, and rowspan if necessary. You can use br to break up text lines in cells.
|
| 45 |
+
- If you see a dollar sign ($), or a percent sign (%) associated with a number, combine it with the number it is associated with in a single column versus splitting it into multiple columns.
|
| 46 |
|
| 47 |
**Instructions:**
|
| 48 |
1. Carefully examine the provided text block image.
|
| 49 |
2. Analyze the html representation of the table.
|
| 50 |
+
3. Write a comparison of the image and the html representation.
|
| 51 |
+
4. If the html representation is largely correct, or you cannot read the image properly, then write "No corrections needed." If the html representation contains errors, generate the corrected html representation. Output only either the corrected html representation or "No corrections needed."
|
|
|
|
| 52 |
**Example:**
|
| 53 |
Input:
|
| 54 |
```html
|
|
|
|
| 67 |
```
|
| 68 |
Output:
|
| 69 |
```html
|
| 70 |
+
Comparison: The image shows a table with 2 rows and 3 columns. The text and formatting of the html table matches the image.
|
| 71 |
No corrections needed.
|
| 72 |
```
|
| 73 |
**Input:**
|
|
|
|
| 238 |
return cells
|
| 239 |
|
| 240 |
class TableSchema(BaseModel):
|
| 241 |
+
description: str
|
| 242 |
correct_html: str
|
marker/processors/llm/llm_table_merge.py
CHANGED
|
@@ -39,7 +39,7 @@ class LLMTableMergeProcessor(BaseLLMProcessor):
|
|
| 39 |
horizontal_table_distance_threshold: Annotated[
|
| 40 |
int,
|
| 41 |
"The maximum distance between table edges for adjacency."
|
| 42 |
-
] =
|
| 43 |
column_gap_threshold: Annotated[
|
| 44 |
int,
|
| 45 |
"The maximum gap between columns to merge tables"
|
|
|
|
| 39 |
horizontal_table_distance_threshold: Annotated[
|
| 40 |
int,
|
| 41 |
"The maximum distance between table edges for adjacency."
|
| 42 |
+
] = 10
|
| 43 |
column_gap_threshold: Annotated[
|
| 44 |
int,
|
| 45 |
"The maximum gap between columns to merge tables"
|
marker/renderers/markdown.py
CHANGED
|
@@ -99,7 +99,7 @@ class Markdownify(MarkdownConverter):
|
|
| 99 |
for r in range(int(cell.get('rowspan', 1)) - 1):
|
| 100 |
rowspan_cols[i + r] += colspan # Add the colspan to the next rows, so they get the correct number of columns
|
| 101 |
colspans.append(row_cols)
|
| 102 |
-
total_cols = max(colspans)
|
| 103 |
|
| 104 |
grid = [[None for _ in range(total_cols)] for _ in range(total_rows)]
|
| 105 |
|
|
|
|
| 99 |
for r in range(int(cell.get('rowspan', 1)) - 1):
|
| 100 |
rowspan_cols[i + r] += colspan # Add the colspan to the next rows, so they get the correct number of columns
|
| 101 |
colspans.append(row_cols)
|
| 102 |
+
total_cols = max(colspans) if colspans else 0
|
| 103 |
|
| 104 |
grid = [[None for _ in range(total_cols)] for _ in range(total_rows)]
|
| 105 |
|