Vik Paruchuri
commited on
Commit
·
18e797e
1
Parent(s):
6fa9fe6
Initial table integration
Browse files- .github/workflows/tests.yml +0 -4
- CLA.md +2 -2
- README.md +1 -10
- benchmarks/table.py +0 -77
- marker/convert.py +2 -11
- marker/models.py +19 -15
- marker/ocr/recognition.py +15 -4
- marker/pdf/extract_text.py +1 -1
- marker/postprocessors/editor.py +0 -123
- marker/postprocessors/t5.py +0 -141
- marker/settings.py +3 -7
- marker/tables/cells.py +0 -112
- marker/tables/edges.py +0 -122
- marker/tables/table.py +91 -146
- poetry.lock +311 -212
- pyproject.toml +7 -9
.github/workflows/tests.yml
CHANGED
|
@@ -29,10 +29,6 @@ jobs:
|
|
| 29 |
run: |
|
| 30 |
poetry run python benchmarks/overall.py benchmark_data/pdfs benchmark_data/references report.json
|
| 31 |
poetry run python scripts/verify_benchmark_scores.py report.json --type marker
|
| 32 |
-
- name: Run table benchmark
|
| 33 |
-
run: |
|
| 34 |
-
poetry run python benchmarks/table.py tables.json
|
| 35 |
-
poetry run python scripts/verify_benchmark_scores.py tables.json --type table
|
| 36 |
|
| 37 |
|
| 38 |
|
|
|
|
| 29 |
run: |
|
| 30 |
poetry run python benchmarks/overall.py benchmark_data/pdfs benchmark_data/references report.json
|
| 31 |
poetry run python scripts/verify_benchmark_scores.py report.json --type marker
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
|
| 34 |
|
CLA.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
Marker Contributor Agreement
|
| 2 |
|
| 3 |
-
This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean
|
| 4 |
|
| 5 |
If you agree to be bound by these terms, sign by writing "I have read the CLA document and I hereby sign the CLA" in response to the CLA bot Github comment. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.
|
| 6 |
|
|
@@ -20,5 +20,5 @@ If you or your affiliates institute patent litigation against any entity (includ
|
|
| 20 |
- each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this MCA;
|
| 21 |
- to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
|
| 22 |
- each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws.
|
| 23 |
-
You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect.
|
| 24 |
6. This MCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
|
|
|
|
| 1 |
Marker Contributor Agreement
|
| 2 |
|
| 3 |
+
This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Endless Labs, Inc. The term "you" shall mean the person or entity identified below.
|
| 4 |
|
| 5 |
If you agree to be bound by these terms, sign by writing "I have read the CLA document and I hereby sign the CLA" in response to the CLA bot Github comment. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.
|
| 6 |
|
|
|
|
| 20 |
- each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this MCA;
|
| 21 |
- to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
|
| 22 |
- each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws.
|
| 23 |
+
You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Endless Labs, Inc. may publicly disclose your participation in the project, including the fact that you have signed the MCA.
|
| 24 |
6. This MCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
|
README.md
CHANGED
|
@@ -42,7 +42,7 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc
|
|
| 42 |
|
| 43 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
| 44 |
|
| 45 |
-
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
|
| 46 |
|
| 47 |
# Hosted API
|
| 48 |
|
|
@@ -217,14 +217,6 @@ This will benchmark marker against other text extraction methods. It sets up ba
|
|
| 217 |
|
| 218 |
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
|
| 219 |
|
| 220 |
-
### Table benchmark
|
| 221 |
-
|
| 222 |
-
There is a benchmark for table parsing, which you can run with:
|
| 223 |
-
|
| 224 |
-
```shell
|
| 225 |
-
python benchmarks/table.py test_data/tables.json
|
| 226 |
-
```
|
| 227 |
-
|
| 228 |
# Thanks
|
| 229 |
|
| 230 |
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
|
|
@@ -233,6 +225,5 @@ This work would not have been possible without amazing open source models and da
|
|
| 233 |
- Texify
|
| 234 |
- Pypdfium2/pdfium
|
| 235 |
- DocLayNet from IBM
|
| 236 |
-
- ByT5 from Google
|
| 237 |
|
| 238 |
Thank you to the authors of these models and datasets for making them available to the community!
|
|
|
|
| 42 |
|
| 43 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
| 44 |
|
| 45 |
+
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
|
| 46 |
|
| 47 |
# Hosted API
|
| 48 |
|
|
|
|
| 217 |
|
| 218 |
Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.
|
| 219 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
# Thanks
|
| 221 |
|
| 222 |
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
|
|
|
|
| 225 |
- Texify
|
| 226 |
- Pypdfium2/pdfium
|
| 227 |
- DocLayNet from IBM
|
|
|
|
| 228 |
|
| 229 |
Thank you to the authors of these models and datasets for making them available to the community!
|
benchmarks/table.py
DELETED
|
@@ -1,77 +0,0 @@
|
|
| 1 |
-
import argparse
|
| 2 |
-
import json
|
| 3 |
-
|
| 4 |
-
import datasets
|
| 5 |
-
from surya.schema import LayoutResult, LayoutBox
|
| 6 |
-
from tqdm import tqdm
|
| 7 |
-
|
| 8 |
-
from marker.benchmark.table import score_table
|
| 9 |
-
from marker.schema.bbox import rescale_bbox
|
| 10 |
-
from marker.schema.page import Page
|
| 11 |
-
from marker.tables.table import format_tables
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
def main():
|
| 16 |
-
parser = argparse.ArgumentParser(description="Benchmark table conversion.")
|
| 17 |
-
parser.add_argument("out_file", help="Output filename for results")
|
| 18 |
-
parser.add_argument("--dataset", type=str, help="Dataset to use", default="vikp/table_bench")
|
| 19 |
-
args = parser.parse_args()
|
| 20 |
-
|
| 21 |
-
ds = datasets.load_dataset(args.dataset, split="train")
|
| 22 |
-
|
| 23 |
-
results = []
|
| 24 |
-
for i in tqdm(range(len(ds)), desc="Evaluating tables"):
|
| 25 |
-
row = ds[i]
|
| 26 |
-
marker_page = Page(**json.loads(row["marker_page"]))
|
| 27 |
-
table_bbox = row["table_bbox"]
|
| 28 |
-
gpt4_table = json.loads(row["gpt_4_table"])["markdown_table"]
|
| 29 |
-
|
| 30 |
-
# Counterclockwise polygon from top left
|
| 31 |
-
table_poly = [
|
| 32 |
-
[table_bbox[0], table_bbox[1]],
|
| 33 |
-
[table_bbox[2], table_bbox[1]],
|
| 34 |
-
[table_bbox[2], table_bbox[3]],
|
| 35 |
-
[table_bbox[0], table_bbox[3]],
|
| 36 |
-
]
|
| 37 |
-
|
| 38 |
-
# Remove all other tables from the layout results
|
| 39 |
-
layout_result = LayoutResult(
|
| 40 |
-
bboxes=[
|
| 41 |
-
LayoutBox(
|
| 42 |
-
label="Table",
|
| 43 |
-
polygon=table_poly
|
| 44 |
-
)
|
| 45 |
-
],
|
| 46 |
-
segmentation_map="",
|
| 47 |
-
image_bbox=marker_page.text_lines.image_bbox
|
| 48 |
-
)
|
| 49 |
-
|
| 50 |
-
marker_page.layout = layout_result
|
| 51 |
-
format_tables([marker_page])
|
| 52 |
-
|
| 53 |
-
table_blocks = [block for block in marker_page.blocks if block.block_type == "Table"]
|
| 54 |
-
if len(table_blocks) != 1:
|
| 55 |
-
continue
|
| 56 |
-
|
| 57 |
-
table_block = table_blocks[0]
|
| 58 |
-
table_md = table_block.lines[0].spans[0].text
|
| 59 |
-
|
| 60 |
-
results.append({
|
| 61 |
-
"score": score_table(table_md, gpt4_table),
|
| 62 |
-
"arxiv_id": row["arxiv_id"],
|
| 63 |
-
"page_idx": row["page_idx"],
|
| 64 |
-
"marker_table": table_md,
|
| 65 |
-
"gpt4_table": gpt4_table,
|
| 66 |
-
"table_bbox": table_bbox
|
| 67 |
-
})
|
| 68 |
-
|
| 69 |
-
avg_score = sum([r["score"] for r in results]) / len(results)
|
| 70 |
-
print(f"Evaluated {len(results)} tables, average score is {avg_score}.")
|
| 71 |
-
|
| 72 |
-
with open(args.out_file, "w+") as f:
|
| 73 |
-
json.dump(results, f, indent=2)
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
if __name__ == "__main__":
|
| 77 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
marker/convert.py
CHANGED
|
@@ -20,7 +20,6 @@ from marker.pdf.extract_text import get_text_blocks
|
|
| 20 |
from marker.cleaners.headers import filter_header_footer, filter_common_titles
|
| 21 |
from marker.equations.equations import replace_equations
|
| 22 |
from marker.pdf.utils import find_filetype
|
| 23 |
-
from marker.postprocessors.editor import edit_full_text
|
| 24 |
from marker.cleaners.code import identify_code_blocks, indent_blocks
|
| 25 |
from marker.cleaners.bullets import replace_bullets
|
| 26 |
from marker.cleaners.headings import split_heading_blocks
|
|
@@ -83,7 +82,7 @@ def convert_single_pdf(
|
|
| 83 |
doc.del_page(0)
|
| 84 |
|
| 85 |
# Unpack models from list
|
| 86 |
-
texify_model, layout_model, order_model,
|
| 87 |
|
| 88 |
# Identify text lines on pages
|
| 89 |
surya_detection(doc, pages, detection_model, batch_multiplier=batch_multiplier)
|
|
@@ -123,7 +122,7 @@ def convert_single_pdf(
|
|
| 123 |
indent_blocks(pages)
|
| 124 |
|
| 125 |
# Fix table blocks
|
| 126 |
-
table_count = format_tables(pages)
|
| 127 |
out_meta["block_stats"]["table"] = table_count
|
| 128 |
|
| 129 |
for page in pages:
|
|
@@ -160,14 +159,6 @@ def convert_single_pdf(
|
|
| 160 |
# Replace bullet characters with a -
|
| 161 |
full_text = replace_bullets(full_text)
|
| 162 |
|
| 163 |
-
# Postprocess text with editor model
|
| 164 |
-
full_text, edit_stats = edit_full_text(
|
| 165 |
-
full_text,
|
| 166 |
-
edit_model,
|
| 167 |
-
batch_multiplier=batch_multiplier
|
| 168 |
-
)
|
| 169 |
-
flush_cuda_memory()
|
| 170 |
-
out_meta["postprocess_stats"] = {"edit": edit_stats}
|
| 171 |
doc_images = images_to_dict(pages)
|
| 172 |
|
| 173 |
return full_text, doc_images, out_meta
|
|
|
|
| 20 |
from marker.cleaners.headers import filter_header_footer, filter_common_titles
|
| 21 |
from marker.equations.equations import replace_equations
|
| 22 |
from marker.pdf.utils import find_filetype
|
|
|
|
| 23 |
from marker.cleaners.code import identify_code_blocks, indent_blocks
|
| 24 |
from marker.cleaners.bullets import replace_bullets
|
| 25 |
from marker.cleaners.headings import split_heading_blocks
|
|
|
|
| 82 |
doc.del_page(0)
|
| 83 |
|
| 84 |
# Unpack models from list
|
| 85 |
+
texify_model, layout_model, order_model, detection_model, ocr_model, table_rec_model = model_lst
|
| 86 |
|
| 87 |
# Identify text lines on pages
|
| 88 |
surya_detection(doc, pages, detection_model, batch_multiplier=batch_multiplier)
|
|
|
|
| 122 |
indent_blocks(pages)
|
| 123 |
|
| 124 |
# Fix table blocks
|
| 125 |
+
table_count = format_tables(pages, doc, fname, detection_model, table_rec_model, ocr_model)
|
| 126 |
out_meta["block_stats"]["table"] = table_count
|
| 127 |
|
| 128 |
for page in pages:
|
|
|
|
| 159 |
# Replace bullet characters with a -
|
| 160 |
full_text = replace_bullets(full_text)
|
| 161 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
doc_images = images_to_dict(pages)
|
| 163 |
|
| 164 |
return full_text, doc_images, out_meta
|
marker/models.py
CHANGED
|
@@ -2,7 +2,6 @@ import os
|
|
| 2 |
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
|
| 3 |
|
| 4 |
|
| 5 |
-
from marker.postprocessors.editor import load_editing_model
|
| 6 |
from surya.model.detection.model import load_model as load_detection_model, load_processor as load_detection_processor
|
| 7 |
from texify.model.model import load_model as load_texify_model
|
| 8 |
from texify.model.processor import load_processor as load_texify_processor
|
|
@@ -11,6 +10,17 @@ from surya.model.recognition.model import load_model as load_recognition_model
|
|
| 11 |
from surya.model.recognition.processor import load_processor as load_recognition_processor
|
| 12 |
from surya.model.ordering.model import load_model as load_order_model
|
| 13 |
from surya.model.ordering.processor import load_processor as load_order_processor
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
|
| 16 |
def setup_recognition_model(device=None, dtype=None):
|
|
@@ -18,8 +28,7 @@ def setup_recognition_model(device=None, dtype=None):
|
|
| 18 |
rec_model = load_recognition_model(device=device, dtype=dtype)
|
| 19 |
else:
|
| 20 |
rec_model = load_recognition_model()
|
| 21 |
-
|
| 22 |
-
rec_model.processor = rec_processor
|
| 23 |
return rec_model
|
| 24 |
|
| 25 |
|
|
@@ -28,9 +37,7 @@ def setup_detection_model(device=None, dtype=None):
|
|
| 28 |
model = load_detection_model(device=device, dtype=dtype)
|
| 29 |
else:
|
| 30 |
model = load_detection_model()
|
| 31 |
-
|
| 32 |
-
processor = load_detection_processor()
|
| 33 |
-
model.processor = processor
|
| 34 |
return model
|
| 35 |
|
| 36 |
|
|
@@ -39,8 +46,7 @@ def setup_texify_model(device=None, dtype=None):
|
|
| 39 |
texify_model = load_texify_model(checkpoint=settings.TEXIFY_MODEL_NAME, device=device, dtype=dtype)
|
| 40 |
else:
|
| 41 |
texify_model = load_texify_model(checkpoint=settings.TEXIFY_MODEL_NAME, device=settings.TORCH_DEVICE_MODEL, dtype=settings.TEXIFY_DTYPE)
|
| 42 |
-
|
| 43 |
-
texify_model.processor = texify_processor
|
| 44 |
return texify_model
|
| 45 |
|
| 46 |
|
|
@@ -49,8 +55,7 @@ def setup_layout_model(device=None, dtype=None):
|
|
| 49 |
model = load_detection_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT, device=device, dtype=dtype)
|
| 50 |
else:
|
| 51 |
model = load_detection_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
|
| 52 |
-
processor = load_detection_processor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
|
| 53 |
-
model.processor = processor
|
| 54 |
return model
|
| 55 |
|
| 56 |
|
|
@@ -59,12 +64,11 @@ def setup_order_model(device=None, dtype=None):
|
|
| 59 |
model = load_order_model(device=device, dtype=dtype)
|
| 60 |
else:
|
| 61 |
model = load_order_model()
|
| 62 |
-
processor = load_order_processor()
|
| 63 |
-
model.processor = processor
|
| 64 |
return model
|
| 65 |
|
| 66 |
|
| 67 |
-
def load_all_models(device=None, dtype=None
|
| 68 |
if device is not None:
|
| 69 |
assert dtype is not None, "Must provide dtype if device is provided"
|
| 70 |
|
|
@@ -72,10 +76,10 @@ def load_all_models(device=None, dtype=None, force_load_ocr=False):
|
|
| 72 |
detection = setup_detection_model(device, dtype)
|
| 73 |
layout = setup_layout_model(device, dtype)
|
| 74 |
order = setup_order_model(device, dtype)
|
| 75 |
-
edit = load_editing_model(device, dtype)
|
| 76 |
|
| 77 |
# Only load recognition model if we'll need it for all pdfs
|
| 78 |
ocr = setup_recognition_model(device, dtype)
|
| 79 |
texify = setup_texify_model(device, dtype)
|
| 80 |
-
|
|
|
|
| 81 |
return model_lst
|
|
|
|
| 2 |
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
|
| 3 |
|
| 4 |
|
|
|
|
| 5 |
from surya.model.detection.model import load_model as load_detection_model, load_processor as load_detection_processor
|
| 6 |
from texify.model.model import load_model as load_texify_model
|
| 7 |
from texify.model.processor import load_processor as load_texify_processor
|
|
|
|
| 10 |
from surya.model.recognition.processor import load_processor as load_recognition_processor
|
| 11 |
from surya.model.ordering.model import load_model as load_order_model
|
| 12 |
from surya.model.ordering.processor import load_processor as load_order_processor
|
| 13 |
+
from surya.model.table_rec.model import load_model as load_table_model
|
| 14 |
+
from surya.model.table_rec.processor import load_processor as load_table_processor
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def setup_table_rec_model(device=None, dtype=None):
|
| 18 |
+
if device:
|
| 19 |
+
table_model = load_table_model(device=device, dtype=dtype)
|
| 20 |
+
else:
|
| 21 |
+
table_model = load_table_model()
|
| 22 |
+
table_model.processor = load_table_processor()
|
| 23 |
+
return table_model
|
| 24 |
|
| 25 |
|
| 26 |
def setup_recognition_model(device=None, dtype=None):
|
|
|
|
| 28 |
rec_model = load_recognition_model(device=device, dtype=dtype)
|
| 29 |
else:
|
| 30 |
rec_model = load_recognition_model()
|
| 31 |
+
rec_model.processor = load_recognition_processor()
|
|
|
|
| 32 |
return rec_model
|
| 33 |
|
| 34 |
|
|
|
|
| 37 |
model = load_detection_model(device=device, dtype=dtype)
|
| 38 |
else:
|
| 39 |
model = load_detection_model()
|
| 40 |
+
model.processor = load_detection_processor()
|
|
|
|
|
|
|
| 41 |
return model
|
| 42 |
|
| 43 |
|
|
|
|
| 46 |
texify_model = load_texify_model(checkpoint=settings.TEXIFY_MODEL_NAME, device=device, dtype=dtype)
|
| 47 |
else:
|
| 48 |
texify_model = load_texify_model(checkpoint=settings.TEXIFY_MODEL_NAME, device=settings.TORCH_DEVICE_MODEL, dtype=settings.TEXIFY_DTYPE)
|
| 49 |
+
texify_model.processor = load_texify_processor()
|
|
|
|
| 50 |
return texify_model
|
| 51 |
|
| 52 |
|
|
|
|
| 55 |
model = load_detection_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT, device=device, dtype=dtype)
|
| 56 |
else:
|
| 57 |
model = load_detection_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
|
| 58 |
+
model.processor = load_detection_processor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
|
|
|
|
| 59 |
return model
|
| 60 |
|
| 61 |
|
|
|
|
| 64 |
model = load_order_model(device=device, dtype=dtype)
|
| 65 |
else:
|
| 66 |
model = load_order_model()
|
| 67 |
+
model.processor = load_order_processor()
|
|
|
|
| 68 |
return model
|
| 69 |
|
| 70 |
|
| 71 |
+
def load_all_models(device=None, dtype=None):
|
| 72 |
if device is not None:
|
| 73 |
assert dtype is not None, "Must provide dtype if device is provided"
|
| 74 |
|
|
|
|
| 76 |
detection = setup_detection_model(device, dtype)
|
| 77 |
layout = setup_layout_model(device, dtype)
|
| 78 |
order = setup_order_model(device, dtype)
|
|
|
|
| 79 |
|
| 80 |
# Only load recognition model if we'll need it for all pdfs
|
| 81 |
ocr = setup_recognition_model(device, dtype)
|
| 82 |
texify = setup_texify_model(device, dtype)
|
| 83 |
+
table_model = setup_table_rec_model(device, dtype)
|
| 84 |
+
model_lst = [texify, layout, order, detection, ocr, table_model]
|
| 85 |
return model_lst
|
marker/ocr/recognition.py
CHANGED
|
@@ -65,7 +65,10 @@ def run_ocr(doc, pages: List[Page], langs: List[str], rec_model, batch_multiplie
|
|
| 65 |
|
| 66 |
|
| 67 |
def surya_recognition(doc, page_idxs, langs: List[str], rec_model, pages: List[Page], batch_multiplier=1) -> List[Optional[Page]]:
|
|
|
|
| 68 |
images = [render_image(doc[pnum], dpi=settings.SURYA_OCR_DPI) for pnum in page_idxs]
|
|
|
|
|
|
|
| 69 |
processor = rec_model.processor
|
| 70 |
selected_pages = [p for i, p in enumerate(pages) if i in page_idxs]
|
| 71 |
|
|
@@ -73,6 +76,12 @@ def surya_recognition(doc, page_idxs, langs: List[str], rec_model, pages: List[P
|
|
| 73 |
detection_results = [p.text_lines.bboxes for p in selected_pages]
|
| 74 |
polygons = [[b.polygon for b in bboxes] for bboxes in detection_results]
|
| 75 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
results = run_recognition(images, surya_langs, rec_model, processor, polygons=polygons, batch_size=int(get_batch_size() * batch_multiplier))
|
| 77 |
|
| 78 |
new_pages = []
|
|
@@ -81,14 +90,15 @@ def surya_recognition(doc, page_idxs, langs: List[str], rec_model, pages: List[P
|
|
| 81 |
ocr_results = result.text_lines
|
| 82 |
blocks = []
|
| 83 |
for i, line in enumerate(ocr_results):
|
|
|
|
| 84 |
block = Block(
|
| 85 |
-
bbox=
|
| 86 |
pnum=page_idx,
|
| 87 |
lines=[Line(
|
| 88 |
-
bbox=
|
| 89 |
spans=[Span(
|
| 90 |
text=line.text,
|
| 91 |
-
bbox=
|
| 92 |
span_id=f"{page_idx}_{i}",
|
| 93 |
font="",
|
| 94 |
font_weight=0,
|
|
@@ -98,10 +108,11 @@ def surya_recognition(doc, page_idxs, langs: List[str], rec_model, pages: List[P
|
|
| 98 |
)]
|
| 99 |
)
|
| 100 |
blocks.append(block)
|
|
|
|
| 101 |
page = Page(
|
| 102 |
blocks=blocks,
|
| 103 |
pnum=page_idx,
|
| 104 |
-
bbox=
|
| 105 |
rotation=0,
|
| 106 |
text_lines=text_lines,
|
| 107 |
ocr_method="surya"
|
|
|
|
| 65 |
|
| 66 |
|
| 67 |
def surya_recognition(doc, page_idxs, langs: List[str], rec_model, pages: List[Page], batch_multiplier=1) -> List[Optional[Page]]:
|
| 68 |
+
# Slice images in higher resolution than detection happened in
|
| 69 |
images = [render_image(doc[pnum], dpi=settings.SURYA_OCR_DPI) for pnum in page_idxs]
|
| 70 |
+
box_scale = settings.SURYA_OCR_DPI / settings.SURYA_DETECTOR_DPI
|
| 71 |
+
|
| 72 |
processor = rec_model.processor
|
| 73 |
selected_pages = [p for i, p in enumerate(pages) if i in page_idxs]
|
| 74 |
|
|
|
|
| 76 |
detection_results = [p.text_lines.bboxes for p in selected_pages]
|
| 77 |
polygons = [[b.polygon for b in bboxes] for bboxes in detection_results]
|
| 78 |
|
| 79 |
+
# Scale polygons to get correct image slices
|
| 80 |
+
for poly in polygons:
|
| 81 |
+
for p in poly:
|
| 82 |
+
for i in range(len(p)):
|
| 83 |
+
p[i] = [int(p[i][0] * box_scale), int(p[i][1] * box_scale)]
|
| 84 |
+
|
| 85 |
results = run_recognition(images, surya_langs, rec_model, processor, polygons=polygons, batch_size=int(get_batch_size() * batch_multiplier))
|
| 86 |
|
| 87 |
new_pages = []
|
|
|
|
| 90 |
ocr_results = result.text_lines
|
| 91 |
blocks = []
|
| 92 |
for i, line in enumerate(ocr_results):
|
| 93 |
+
scaled_bbox = [b / box_scale for b in line.bbox]
|
| 94 |
block = Block(
|
| 95 |
+
bbox=scaled_bbox,
|
| 96 |
pnum=page_idx,
|
| 97 |
lines=[Line(
|
| 98 |
+
bbox=scaled_bbox,
|
| 99 |
spans=[Span(
|
| 100 |
text=line.text,
|
| 101 |
+
bbox=scaled_bbox,
|
| 102 |
span_id=f"{page_idx}_{i}",
|
| 103 |
font="",
|
| 104 |
font_weight=0,
|
|
|
|
| 108 |
)]
|
| 109 |
)
|
| 110 |
blocks.append(block)
|
| 111 |
+
scaled_image_bbox = [b / box_scale for b in result.image_bbox]
|
| 112 |
page = Page(
|
| 113 |
blocks=blocks,
|
| 114 |
pnum=page_idx,
|
| 115 |
+
bbox=scaled_image_bbox,
|
| 116 |
rotation=0,
|
| 117 |
text_lines=text_lines,
|
| 118 |
ocr_method="surya"
|
marker/pdf/extract_text.py
CHANGED
|
@@ -90,7 +90,7 @@ def get_text_blocks(doc, fname, max_pages: Optional[int] = None, start_page: Opt
|
|
| 90 |
|
| 91 |
page_range = range(start_page, start_page + max_pages)
|
| 92 |
|
| 93 |
-
char_blocks = dictionary_output(fname, page_range=page_range, keep_chars=
|
| 94 |
marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]
|
| 95 |
|
| 96 |
return marker_blocks, toc
|
|
|
|
| 90 |
|
| 91 |
page_range = range(start_page, start_page + max_pages)
|
| 92 |
|
| 93 |
+
char_blocks = dictionary_output(fname, page_range=page_range, keep_chars=False, workers=settings.PDFTEXT_CPU_WORKERS)
|
| 94 |
marker_blocks = [pdftext_format_to_blocks(page, pnum) for pnum, page in enumerate(char_blocks)]
|
| 95 |
|
| 96 |
return marker_blocks, toc
|
marker/postprocessors/editor.py
DELETED
|
@@ -1,123 +0,0 @@
|
|
| 1 |
-
from collections import defaultdict
|
| 2 |
-
from itertools import chain
|
| 3 |
-
from typing import Optional
|
| 4 |
-
|
| 5 |
-
from marker.settings import settings
|
| 6 |
-
import torch
|
| 7 |
-
import torch.nn.functional as F
|
| 8 |
-
from marker.postprocessors.t5 import T5ForTokenClassification, byt5_tokenize
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
def get_batch_size():
|
| 12 |
-
if settings.EDITOR_BATCH_SIZE is not None:
|
| 13 |
-
return settings.EDITOR_BATCH_SIZE
|
| 14 |
-
elif settings.TORCH_DEVICE_MODEL == "cuda":
|
| 15 |
-
return 12
|
| 16 |
-
return 6
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
def load_editing_model(device=None, dtype=None):
|
| 20 |
-
if not settings.ENABLE_EDITOR_MODEL:
|
| 21 |
-
return None
|
| 22 |
-
|
| 23 |
-
if device:
|
| 24 |
-
model = T5ForTokenClassification.from_pretrained(
|
| 25 |
-
settings.EDITOR_MODEL_NAME,
|
| 26 |
-
torch_dtype=dtype,
|
| 27 |
-
device=device,
|
| 28 |
-
)
|
| 29 |
-
else:
|
| 30 |
-
model = T5ForTokenClassification.from_pretrained(
|
| 31 |
-
settings.EDITOR_MODEL_NAME,
|
| 32 |
-
torch_dtype=settings.MODEL_DTYPE,
|
| 33 |
-
).to(settings.TORCH_DEVICE_MODEL)
|
| 34 |
-
model.eval()
|
| 35 |
-
|
| 36 |
-
model.config.label2id = {
|
| 37 |
-
"equal": 0,
|
| 38 |
-
"delete": 1,
|
| 39 |
-
"newline-1": 2,
|
| 40 |
-
"space-1": 3,
|
| 41 |
-
}
|
| 42 |
-
model.config.id2label = {v: k for k, v in model.config.label2id.items()}
|
| 43 |
-
return model
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
def edit_full_text(text: str, model: Optional[T5ForTokenClassification], batch_multiplier=1) -> (str, dict):
|
| 47 |
-
if not model:
|
| 48 |
-
return text, {}
|
| 49 |
-
|
| 50 |
-
batch_size = get_batch_size() * batch_multiplier
|
| 51 |
-
tokenized = byt5_tokenize(text, settings.EDITOR_MAX_LENGTH)
|
| 52 |
-
input_ids = tokenized["input_ids"]
|
| 53 |
-
char_token_lengths = tokenized["char_token_lengths"]
|
| 54 |
-
|
| 55 |
-
# Run model
|
| 56 |
-
token_masks = []
|
| 57 |
-
for i in range(0, len(input_ids), batch_size):
|
| 58 |
-
batch_input_ids = tokenized["input_ids"][i: i + batch_size]
|
| 59 |
-
batch_input_ids = torch.tensor(batch_input_ids, device=model.device)
|
| 60 |
-
batch_attention_mask = tokenized["attention_mask"][i: i + batch_size]
|
| 61 |
-
batch_attention_mask = torch.tensor(batch_attention_mask, device=model.device)
|
| 62 |
-
with torch.inference_mode():
|
| 63 |
-
predictions = model(batch_input_ids, attention_mask=batch_attention_mask)
|
| 64 |
-
|
| 65 |
-
logits = predictions.logits.cpu()
|
| 66 |
-
|
| 67 |
-
# If the max probability is less than a threshold, we assume it's a bad prediction
|
| 68 |
-
# We want to be conservative to not edit the text too much
|
| 69 |
-
probs = F.softmax(logits, dim=-1)
|
| 70 |
-
max_prob = torch.max(probs, dim=-1)
|
| 71 |
-
cutoff_prob = max_prob.values < settings.EDITOR_CUTOFF_THRESH
|
| 72 |
-
labels = logits.argmax(-1)
|
| 73 |
-
labels[cutoff_prob] = model.config.label2id["equal"]
|
| 74 |
-
labels = labels.squeeze().tolist()
|
| 75 |
-
if len(labels) == settings.EDITOR_MAX_LENGTH:
|
| 76 |
-
labels = [labels]
|
| 77 |
-
labels = list(chain.from_iterable(labels))
|
| 78 |
-
token_masks.extend(labels)
|
| 79 |
-
|
| 80 |
-
# List of characters in the text
|
| 81 |
-
flat_input_ids = list(chain.from_iterable(input_ids))
|
| 82 |
-
|
| 83 |
-
# Strip special tokens 0,1. Keep unknown token, although it should never be used
|
| 84 |
-
assert len(token_masks) == len(flat_input_ids)
|
| 85 |
-
token_masks = [mask for mask, token in zip(token_masks, flat_input_ids) if token >= 2]
|
| 86 |
-
|
| 87 |
-
assert len(token_masks) == len(list(text.encode("utf-8")))
|
| 88 |
-
|
| 89 |
-
edit_stats = defaultdict(int)
|
| 90 |
-
out_text = []
|
| 91 |
-
start = 0
|
| 92 |
-
for i, char in enumerate(text):
|
| 93 |
-
char_token_length = char_token_lengths[i]
|
| 94 |
-
masks = token_masks[start: start + char_token_length]
|
| 95 |
-
labels = [model.config.id2label[mask] for mask in masks]
|
| 96 |
-
if all(l == "delete" for l in labels):
|
| 97 |
-
# If we delete whitespace, roll with it, otherwise ignore
|
| 98 |
-
if char.strip():
|
| 99 |
-
out_text.append(char)
|
| 100 |
-
else:
|
| 101 |
-
edit_stats["delete"] += 1
|
| 102 |
-
elif labels[0] == "newline-1":
|
| 103 |
-
out_text.append("\n")
|
| 104 |
-
out_text.append(char)
|
| 105 |
-
edit_stats["newline-1"] += 1
|
| 106 |
-
elif labels[0] == "space-1":
|
| 107 |
-
out_text.append(" ")
|
| 108 |
-
out_text.append(char)
|
| 109 |
-
edit_stats["space-1"] += 1
|
| 110 |
-
else:
|
| 111 |
-
out_text.append(char)
|
| 112 |
-
edit_stats["equal"] += 1
|
| 113 |
-
|
| 114 |
-
start += char_token_length
|
| 115 |
-
|
| 116 |
-
out_text = "".join(out_text)
|
| 117 |
-
return out_text, edit_stats
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
marker/postprocessors/t5.py
DELETED
|
@@ -1,141 +0,0 @@
|
|
| 1 |
-
from transformers import T5Config, T5PreTrainedModel
|
| 2 |
-
import torch
|
| 3 |
-
from torch import nn
|
| 4 |
-
from copy import deepcopy
|
| 5 |
-
from typing import Optional, Tuple, Union
|
| 6 |
-
from itertools import chain
|
| 7 |
-
|
| 8 |
-
from transformers.modeling_outputs import TokenClassifierOutput
|
| 9 |
-
from transformers.models.t5.modeling_t5 import T5Stack
|
| 10 |
-
from transformers.utils.model_parallel_utils import get_device_map, assert_device_map
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
def byt5_tokenize(text: str, max_length: int, pad_token_id: int = 0):
|
| 14 |
-
byte_codes = []
|
| 15 |
-
for char in text:
|
| 16 |
-
# Add 3 to account for special tokens
|
| 17 |
-
byte_codes.append([byte + 3 for byte in char.encode('utf-8')])
|
| 18 |
-
|
| 19 |
-
tokens = list(chain.from_iterable(byte_codes))
|
| 20 |
-
# Map each token to the character it represents
|
| 21 |
-
char_token_lengths = [len(b) for b in byte_codes]
|
| 22 |
-
|
| 23 |
-
batched_tokens = []
|
| 24 |
-
attention_mask = []
|
| 25 |
-
for i in range(0, len(tokens), max_length):
|
| 26 |
-
batched_tokens.append(tokens[i:i + max_length])
|
| 27 |
-
attention_mask.append([1] * len(batched_tokens[-1]))
|
| 28 |
-
|
| 29 |
-
# Pad last item
|
| 30 |
-
if len(batched_tokens[-1]) < max_length:
|
| 31 |
-
batched_tokens[-1] += [pad_token_id] * (max_length - len(batched_tokens[-1]))
|
| 32 |
-
attention_mask[-1] += [0] * (max_length - len(attention_mask[-1]))
|
| 33 |
-
|
| 34 |
-
return {"input_ids": batched_tokens, "attention_mask": attention_mask, "char_token_lengths": char_token_lengths}
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
# From https://github.com/osainz59/t5-encoder
|
| 40 |
-
class T5ForTokenClassification(T5PreTrainedModel):
|
| 41 |
-
_keys_to_ignore_on_load_missing = [r"encoder.embed_tokens.weight"]
|
| 42 |
-
|
| 43 |
-
def __init__(self, config: T5Config):
|
| 44 |
-
super().__init__(config)
|
| 45 |
-
self.model_dim = config.d_model
|
| 46 |
-
|
| 47 |
-
self.shared = nn.Embedding(config.vocab_size, config.d_model)
|
| 48 |
-
|
| 49 |
-
encoder_config = deepcopy(config)
|
| 50 |
-
encoder_config.is_decoder = False
|
| 51 |
-
encoder_config.is_encoder_decoder = False
|
| 52 |
-
encoder_config.use_cache = False
|
| 53 |
-
self.encoder = T5Stack(encoder_config, self.shared)
|
| 54 |
-
|
| 55 |
-
classifier_dropout = (
|
| 56 |
-
config.classifier_dropout if hasattr(config, 'classifier_dropout') else config.dropout_rate
|
| 57 |
-
)
|
| 58 |
-
self.dropout = nn.Dropout(classifier_dropout)
|
| 59 |
-
self.classifier = nn.Linear(config.d_model, config.num_labels)
|
| 60 |
-
|
| 61 |
-
# Initialize weights and apply final processing
|
| 62 |
-
self.post_init()
|
| 63 |
-
|
| 64 |
-
# Model parallel
|
| 65 |
-
self.model_parallel = False
|
| 66 |
-
self.device_map = None
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def parallelize(self, device_map=None):
|
| 70 |
-
self.device_map = (
|
| 71 |
-
get_device_map(len(self.encoder.block), range(torch.cuda.device_count()))
|
| 72 |
-
if device_map is None
|
| 73 |
-
else device_map
|
| 74 |
-
)
|
| 75 |
-
assert_device_map(self.device_map, len(self.encoder.block))
|
| 76 |
-
self.encoder.parallelize(self.device_map)
|
| 77 |
-
self.classifier.to(self.encoder.first_device)
|
| 78 |
-
self.model_parallel = True
|
| 79 |
-
|
| 80 |
-
def deparallelize(self):
|
| 81 |
-
self.encoder.deparallelize()
|
| 82 |
-
self.encoder = self.encoder.to("cpu")
|
| 83 |
-
self.classifier = self.classifier.to("cpu")
|
| 84 |
-
self.model_parallel = False
|
| 85 |
-
self.device_map = None
|
| 86 |
-
torch.cuda.empty_cache()
|
| 87 |
-
|
| 88 |
-
def get_input_embeddings(self):
|
| 89 |
-
return self.shared
|
| 90 |
-
|
| 91 |
-
def set_input_embeddings(self, new_embeddings):
|
| 92 |
-
self.shared = new_embeddings
|
| 93 |
-
self.encoder.set_input_embeddings(new_embeddings)
|
| 94 |
-
|
| 95 |
-
def get_encoder(self):
|
| 96 |
-
return self.encoder
|
| 97 |
-
|
| 98 |
-
def _prune_heads(self, heads_to_prune):
|
| 99 |
-
for layer, heads in heads_to_prune.items():
|
| 100 |
-
self.encoder.block[layer].layer[0].SelfAttention.prune_heads(heads)
|
| 101 |
-
|
| 102 |
-
def forward(
|
| 103 |
-
self,
|
| 104 |
-
input_ids: Optional[torch.LongTensor] = None,
|
| 105 |
-
attention_mask: Optional[torch.FloatTensor] = None,
|
| 106 |
-
head_mask: Optional[torch.FloatTensor] = None,
|
| 107 |
-
inputs_embeds: Optional[torch.FloatTensor] = None,
|
| 108 |
-
labels: Optional[torch.LongTensor] = None,
|
| 109 |
-
output_attentions: Optional[bool] = None,
|
| 110 |
-
output_hidden_states: Optional[bool] = None,
|
| 111 |
-
return_dict: Optional[bool] = None,
|
| 112 |
-
) -> Union[Tuple[torch.FloatTensor], TokenClassifierOutput]:
|
| 113 |
-
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
| 114 |
-
|
| 115 |
-
outputs = self.encoder(
|
| 116 |
-
input_ids=input_ids,
|
| 117 |
-
attention_mask=attention_mask,
|
| 118 |
-
inputs_embeds=inputs_embeds,
|
| 119 |
-
head_mask=head_mask,
|
| 120 |
-
output_attentions=output_attentions,
|
| 121 |
-
output_hidden_states=output_hidden_states,
|
| 122 |
-
return_dict=return_dict,
|
| 123 |
-
)
|
| 124 |
-
|
| 125 |
-
sequence_output = outputs[0]
|
| 126 |
-
|
| 127 |
-
sequence_output = self.dropout(sequence_output)
|
| 128 |
-
logits = self.classifier(sequence_output)
|
| 129 |
-
|
| 130 |
-
loss = None
|
| 131 |
-
|
| 132 |
-
if not return_dict:
|
| 133 |
-
output = (logits,) + outputs[2:]
|
| 134 |
-
return ((loss,) + output) if loss is not None else output
|
| 135 |
-
|
| 136 |
-
return TokenClassifierOutput(
|
| 137 |
-
loss=loss,
|
| 138 |
-
logits=logits,
|
| 139 |
-
hidden_states=outputs.hidden_states,
|
| 140 |
-
attentions=outputs.attentions
|
| 141 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
marker/settings.py
CHANGED
|
@@ -47,7 +47,7 @@ class Settings(BaseSettings):
|
|
| 47 |
OCR_ALL_PAGES: bool = False # Run OCR on every page even if text can be extracted
|
| 48 |
|
| 49 |
## Surya
|
| 50 |
-
SURYA_OCR_DPI: int =
|
| 51 |
RECOGNITION_BATCH_SIZE: Optional[int] = None # Batch size for surya OCR defaults to 64 for cuda, 32 otherwise
|
| 52 |
|
| 53 |
## Tesseract
|
|
@@ -75,12 +75,8 @@ class Settings(BaseSettings):
|
|
| 75 |
ORDER_BATCH_SIZE: Optional[int] = None # Defaults to 12 for cuda, 6 otherwise
|
| 76 |
ORDER_MAX_BBOXES: int = 255
|
| 77 |
|
| 78 |
-
#
|
| 79 |
-
|
| 80 |
-
EDITOR_MAX_LENGTH: int = 1024
|
| 81 |
-
EDITOR_MODEL_NAME: str = "vikp/pdf_postprocessor_t5"
|
| 82 |
-
ENABLE_EDITOR_MODEL: bool = False # The editor model can create false positives
|
| 83 |
-
EDITOR_CUTOFF_THRESH: float = 0.9 # Ignore predictions below this probability
|
| 84 |
|
| 85 |
# Debug
|
| 86 |
DEBUG: bool = False # Enable debug logging
|
|
|
|
| 47 |
OCR_ALL_PAGES: bool = False # Run OCR on every page even if text can be extracted
|
| 48 |
|
| 49 |
## Surya
|
| 50 |
+
SURYA_OCR_DPI: int = 192
|
| 51 |
RECOGNITION_BATCH_SIZE: Optional[int] = None # Batch size for surya OCR defaults to 64 for cuda, 32 otherwise
|
| 52 |
|
| 53 |
## Tesseract
|
|
|
|
| 75 |
ORDER_BATCH_SIZE: Optional[int] = None # Defaults to 12 for cuda, 6 otherwise
|
| 76 |
ORDER_MAX_BBOXES: int = 255
|
| 77 |
|
| 78 |
+
# Table models
|
| 79 |
+
SURYA_TABLE_DPI: int = 192
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
# Debug
|
| 82 |
DEBUG: bool = False # Enable debug logging
|
marker/tables/cells.py
DELETED
|
@@ -1,112 +0,0 @@
|
|
| 1 |
-
from marker.schema.bbox import rescale_bbox, box_intersection_pct
|
| 2 |
-
from marker.schema.page import Page
|
| 3 |
-
import numpy as np
|
| 4 |
-
from sklearn.cluster import DBSCAN
|
| 5 |
-
from marker.settings import settings
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
def cluster_coords(coords, row_count):
|
| 9 |
-
if len(coords) == 0:
|
| 10 |
-
return []
|
| 11 |
-
coords = np.array(sorted(set(coords))).reshape(-1, 1)
|
| 12 |
-
|
| 13 |
-
clustering = DBSCAN(eps=.01, min_samples=max(2, row_count // 4)).fit(coords)
|
| 14 |
-
clusters = clustering.labels_
|
| 15 |
-
|
| 16 |
-
separators = []
|
| 17 |
-
for label in set(clusters):
|
| 18 |
-
clustered_points = coords[clusters == label]
|
| 19 |
-
separators.append(np.mean(clustered_points))
|
| 20 |
-
|
| 21 |
-
separators = sorted(separators)
|
| 22 |
-
return separators
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
def find_column_separators(page: Page, table_box, rows, round_factor=.002, min_count=1):
|
| 26 |
-
left_edges = []
|
| 27 |
-
right_edges = []
|
| 28 |
-
centers = []
|
| 29 |
-
|
| 30 |
-
line_boxes = [p.bbox for p in page.text_lines.bboxes]
|
| 31 |
-
line_boxes = [rescale_bbox(page.text_lines.image_bbox, page.bbox, l) for l in line_boxes]
|
| 32 |
-
line_boxes = [l for l in line_boxes if box_intersection_pct(l, table_box) > settings.BBOX_INTERSECTION_THRESH]
|
| 33 |
-
|
| 34 |
-
pwidth = page.bbox[2] - page.bbox[0]
|
| 35 |
-
pheight = page.bbox[3] - page.bbox[1]
|
| 36 |
-
for cell in line_boxes:
|
| 37 |
-
ncell = [cell[0] / pwidth, cell[1] / pheight, cell[2] / pwidth, cell[3] / pheight]
|
| 38 |
-
left_edges.append(ncell[0] / round_factor * round_factor)
|
| 39 |
-
right_edges.append(ncell[2] / round_factor * round_factor)
|
| 40 |
-
centers.append((ncell[0] + ncell[2]) / 2 * round_factor / round_factor)
|
| 41 |
-
|
| 42 |
-
left_edges = [l for l in left_edges if left_edges.count(l) > min_count]
|
| 43 |
-
right_edges = [r for r in right_edges if right_edges.count(r) > min_count]
|
| 44 |
-
centers = [c for c in centers if centers.count(c) > min_count]
|
| 45 |
-
|
| 46 |
-
sorted_left = cluster_coords(left_edges, len(rows))
|
| 47 |
-
sorted_right = cluster_coords(right_edges, len(rows))
|
| 48 |
-
sorted_center = cluster_coords(centers, len(rows))
|
| 49 |
-
|
| 50 |
-
# Find list with minimum length
|
| 51 |
-
separators = max([sorted_left, sorted_right, sorted_center], key=len)
|
| 52 |
-
separators.append(1)
|
| 53 |
-
separators.insert(0, 0)
|
| 54 |
-
return separators
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
def assign_cells_to_columns(page, table_box, rows, round_factor=.002, tolerance=.01):
|
| 58 |
-
separators = find_column_separators(page, table_box, rows, round_factor=round_factor)
|
| 59 |
-
additional_column_index = 0
|
| 60 |
-
pwidth = page.bbox[2] - page.bbox[0]
|
| 61 |
-
row_dicts = []
|
| 62 |
-
|
| 63 |
-
for row in rows:
|
| 64 |
-
new_row = {}
|
| 65 |
-
last_col_index = -1
|
| 66 |
-
for cell in row:
|
| 67 |
-
left_edge = cell[0][0] / pwidth
|
| 68 |
-
column_index = -1
|
| 69 |
-
for i, separator in enumerate(separators):
|
| 70 |
-
if left_edge - tolerance < separator and last_col_index < i:
|
| 71 |
-
column_index = i
|
| 72 |
-
break
|
| 73 |
-
if column_index == -1:
|
| 74 |
-
column_index = len(separators) + additional_column_index
|
| 75 |
-
additional_column_index += 1
|
| 76 |
-
new_row[column_index] = cell[1]
|
| 77 |
-
last_col_index = column_index
|
| 78 |
-
additional_column_index = 0
|
| 79 |
-
row_dicts.append(new_row)
|
| 80 |
-
|
| 81 |
-
max_row_idx = 0
|
| 82 |
-
for row in row_dicts:
|
| 83 |
-
max_row_idx = max(max_row_idx, max(row.keys()))
|
| 84 |
-
|
| 85 |
-
# Assign sorted cells to columns, account for blanks
|
| 86 |
-
new_rows = []
|
| 87 |
-
for row in row_dicts:
|
| 88 |
-
flat_row = []
|
| 89 |
-
for row_idx in range(1, max_row_idx + 1):
|
| 90 |
-
if row_idx in row:
|
| 91 |
-
flat_row.append(row[row_idx])
|
| 92 |
-
else:
|
| 93 |
-
flat_row.append("")
|
| 94 |
-
new_rows.append(flat_row)
|
| 95 |
-
|
| 96 |
-
# Pad rows to have the same length
|
| 97 |
-
max_row_len = max([len(r) for r in new_rows])
|
| 98 |
-
for row in new_rows:
|
| 99 |
-
while len(row) < max_row_len:
|
| 100 |
-
row.append("")
|
| 101 |
-
|
| 102 |
-
cols_to_remove = set()
|
| 103 |
-
for idx, col in enumerate(zip(*new_rows)):
|
| 104 |
-
col_total = sum([len(cell.strip()) > 0 for cell in col])
|
| 105 |
-
if col_total == 0:
|
| 106 |
-
cols_to_remove.add(idx)
|
| 107 |
-
|
| 108 |
-
rows = []
|
| 109 |
-
for row in new_rows:
|
| 110 |
-
rows.append([col for idx, col in enumerate(row) if idx not in cols_to_remove])
|
| 111 |
-
|
| 112 |
-
return rows
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
marker/tables/edges.py
DELETED
|
@@ -1,122 +0,0 @@
|
|
| 1 |
-
import math
|
| 2 |
-
|
| 3 |
-
import cv2
|
| 4 |
-
import numpy as np
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
def get_detected_lines_sobel(image):
|
| 8 |
-
sobelx = cv2.Sobel(image, cv2.CV_32F, 1, 0, ksize=3)
|
| 9 |
-
|
| 10 |
-
scaled_sobel = np.uint8(255 * sobelx / np.max(sobelx))
|
| 11 |
-
|
| 12 |
-
kernel = np.ones((4, 1), np.uint8)
|
| 13 |
-
eroded = cv2.erode(scaled_sobel, kernel, iterations=1)
|
| 14 |
-
scaled_sobel = cv2.dilate(eroded, kernel, iterations=3)
|
| 15 |
-
|
| 16 |
-
return scaled_sobel
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
def get_line_angle(x1, y1, x2, y2):
|
| 20 |
-
slope = (y2 - y1) / (x2 - x1)
|
| 21 |
-
|
| 22 |
-
angle_radians = math.atan(slope)
|
| 23 |
-
angle_degrees = math.degrees(angle_radians)
|
| 24 |
-
|
| 25 |
-
return angle_degrees
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
def get_detected_lines(image, slope_tol_deg=10):
|
| 29 |
-
new_image = image.astype(np.float32) * 255 # Convert to 0-255 range
|
| 30 |
-
new_image = get_detected_lines_sobel(new_image)
|
| 31 |
-
new_image = new_image.astype(np.uint8)
|
| 32 |
-
|
| 33 |
-
edges = cv2.Canny(new_image, 50, 200, apertureSize=3)
|
| 34 |
-
|
| 35 |
-
lines = cv2.HoughLinesP(edges, 1, np.pi / 180, threshold=50, minLineLength=2, maxLineGap=100)
|
| 36 |
-
|
| 37 |
-
line_info = []
|
| 38 |
-
if lines is not None:
|
| 39 |
-
for line in lines:
|
| 40 |
-
x1, y1, x2, y2 = line[0]
|
| 41 |
-
bbox = [x1, y1, x2, y2]
|
| 42 |
-
|
| 43 |
-
vertical = False
|
| 44 |
-
if x2 == x1:
|
| 45 |
-
vertical = True
|
| 46 |
-
else:
|
| 47 |
-
line_angle = get_line_angle(x1, y1, x2, y2)
|
| 48 |
-
if 90 - slope_tol_deg < line_angle < 90 + slope_tol_deg:
|
| 49 |
-
vertical = True
|
| 50 |
-
elif -90 - slope_tol_deg < line_angle < -90 + slope_tol_deg:
|
| 51 |
-
vertical = True
|
| 52 |
-
if not vertical:
|
| 53 |
-
continue
|
| 54 |
-
|
| 55 |
-
if bbox[3] < bbox[1]:
|
| 56 |
-
bbox[1], bbox[3] = bbox[3], bbox[1]
|
| 57 |
-
if bbox[2] < bbox[0]:
|
| 58 |
-
bbox[0], bbox[2] = bbox[2], bbox[0]
|
| 59 |
-
if vertical:
|
| 60 |
-
line_info.append(bbox)
|
| 61 |
-
return line_info
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
def get_vertical_lines(image, divisor=2, x_tolerance=10, y_tolerance=1):
|
| 65 |
-
vertical_lines = get_detected_lines(image)
|
| 66 |
-
|
| 67 |
-
vertical_lines = sorted(vertical_lines, key=lambda x: x[0])
|
| 68 |
-
for line in vertical_lines:
|
| 69 |
-
for i in range(0, len(line)):
|
| 70 |
-
line[i] = (line[i] // divisor) * divisor
|
| 71 |
-
|
| 72 |
-
# Merge adjacent line segments together
|
| 73 |
-
to_remove = []
|
| 74 |
-
for i, line in enumerate(vertical_lines):
|
| 75 |
-
for j, line2 in enumerate(vertical_lines):
|
| 76 |
-
if j <= i:
|
| 77 |
-
continue
|
| 78 |
-
if line[0] != line2[0]:
|
| 79 |
-
continue
|
| 80 |
-
|
| 81 |
-
expanded_line1 = [line[0], line[1] - y_tolerance, line[2],
|
| 82 |
-
line[3] + y_tolerance]
|
| 83 |
-
|
| 84 |
-
line1_points = set(range(int(expanded_line1[1]), int(expanded_line1[3])))
|
| 85 |
-
line2_points = set(range(int(line2[1]), int(line2[3])))
|
| 86 |
-
intersect_y = len(line1_points.intersection(line2_points)) > 0
|
| 87 |
-
|
| 88 |
-
if intersect_y:
|
| 89 |
-
vertical_lines[j][1] = min(line[1], line2[1])
|
| 90 |
-
vertical_lines[j][3] = max(line[3], line2[3])
|
| 91 |
-
to_remove.append(i)
|
| 92 |
-
|
| 93 |
-
vertical_lines = [line for i, line in enumerate(vertical_lines) if i not in to_remove]
|
| 94 |
-
|
| 95 |
-
# Remove redundant segments
|
| 96 |
-
to_remove = []
|
| 97 |
-
for i, line in enumerate(vertical_lines):
|
| 98 |
-
if i in to_remove:
|
| 99 |
-
continue
|
| 100 |
-
for j, line2 in enumerate(vertical_lines):
|
| 101 |
-
if j <= i or j in to_remove:
|
| 102 |
-
continue
|
| 103 |
-
close_in_x = abs(line[0] - line2[0]) < x_tolerance
|
| 104 |
-
line1_points = set(range(int(line[1]), int(line[3])))
|
| 105 |
-
line2_points = set(range(int(line2[1]), int(line2[3])))
|
| 106 |
-
|
| 107 |
-
intersect_y = len(line1_points.intersection(line2_points)) > 0
|
| 108 |
-
|
| 109 |
-
if close_in_x and intersect_y:
|
| 110 |
-
# Keep the longer line and extend it
|
| 111 |
-
if len(line2_points) > len(line1_points):
|
| 112 |
-
vertical_lines[j][1] = min(line[1], line2[1])
|
| 113 |
-
vertical_lines[j][3] = max(line[3], line2[3])
|
| 114 |
-
to_remove.append(i)
|
| 115 |
-
else:
|
| 116 |
-
vertical_lines[i][1] = min(line[1], line2[1])
|
| 117 |
-
vertical_lines[i][3] = max(line[3], line2[3])
|
| 118 |
-
to_remove.append(j)
|
| 119 |
-
|
| 120 |
-
vertical_lines = [line for i, line in enumerate(vertical_lines) if i not in to_remove]
|
| 121 |
-
|
| 122 |
-
return vertical_lines
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
marker/tables/table.py
CHANGED
|
@@ -1,155 +1,107 @@
|
|
| 1 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
from marker.schema.block import Line, Span, Block
|
| 3 |
from marker.schema.page import Page
|
| 4 |
-
from tabulate import tabulate
|
| 5 |
from typing import List
|
| 6 |
|
| 7 |
from marker.settings import settings
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
for
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
normed_x_start = line_bbox[0] / page.width
|
| 25 |
-
normed_x_end = line_bbox[2] / page.width
|
| 26 |
-
|
| 27 |
-
cells = [[s.bbox, s.text] for s in line.spans]
|
| 28 |
-
if x_position is None or normed_x_start > x_position - space_tol:
|
| 29 |
-
# Same row
|
| 30 |
-
table_row.extend(cells)
|
| 31 |
-
else:
|
| 32 |
-
# New row
|
| 33 |
-
if len(table_row) > 0:
|
| 34 |
-
table_rows.append(table_row)
|
| 35 |
-
table_row = cells
|
| 36 |
-
x_position = normed_x_end
|
| 37 |
-
if len(table_row) > 0:
|
| 38 |
-
table_rows.append(table_row)
|
| 39 |
-
table_rows = assign_cells_to_columns(page, table_box, table_rows)
|
| 40 |
-
return table_rows
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
def get_table_pdftext(page: Page, table_box, space_tol=.01, round_factor=4) -> List[List[str]]:
|
| 44 |
-
page_width = page.width
|
| 45 |
-
table_rows = []
|
| 46 |
-
table_cell = ""
|
| 47 |
-
cell_bbox = None
|
| 48 |
-
table_row = []
|
| 49 |
-
sorted_char_blocks = sort_table_blocks(page.char_blocks)
|
| 50 |
-
|
| 51 |
-
table_width = table_box[2] - table_box[0]
|
| 52 |
-
new_line_start_x = table_box[0] + table_width * .3
|
| 53 |
-
table_width_pct = (table_width / page_width) * .95
|
| 54 |
-
|
| 55 |
-
for block_idx, block in enumerate(sorted_char_blocks):
|
| 56 |
-
sorted_lines = sort_table_blocks(block["lines"])
|
| 57 |
-
for line_idx, line in enumerate(sorted_lines):
|
| 58 |
-
line_bbox = line["bbox"]
|
| 59 |
-
intersect_pct = box_intersection_pct(line_bbox, table_box)
|
| 60 |
-
if intersect_pct < settings.TABLE_INTERSECTION_THRESH:
|
| 61 |
-
continue
|
| 62 |
-
for span in line["spans"]:
|
| 63 |
-
for char in span["chars"]:
|
| 64 |
-
x_start, y_start, x_end, y_end = char["bbox"]
|
| 65 |
-
x_start /= page_width
|
| 66 |
-
x_end /= page_width
|
| 67 |
-
fullwidth_cell = False
|
| 68 |
-
|
| 69 |
-
if cell_bbox is not None:
|
| 70 |
-
# Find boundaries of cell bbox before merging
|
| 71 |
-
cell_x_start, cell_y_start, cell_x_end, cell_y_end = cell_bbox
|
| 72 |
-
cell_x_start /= page_width
|
| 73 |
-
cell_x_end /= page_width
|
| 74 |
-
|
| 75 |
-
fullwidth_cell = cell_x_end - cell_x_start >= table_width_pct
|
| 76 |
-
|
| 77 |
-
cell_content = replace_dots(replace_newlines(table_cell))
|
| 78 |
-
if cell_bbox is None: # First char
|
| 79 |
-
table_cell += char["char"]
|
| 80 |
-
cell_bbox = char["bbox"]
|
| 81 |
-
# Check if we are in the same cell, ensure cell is not full table width (like if stray text gets included in the table)
|
| 82 |
-
elif (cell_x_start - space_tol < x_start < cell_x_end + space_tol) and not fullwidth_cell:
|
| 83 |
-
table_cell += char["char"]
|
| 84 |
-
cell_bbox = merge_boxes(cell_bbox, char["bbox"])
|
| 85 |
-
# New line and cell
|
| 86 |
-
# Use x_start < new_line_start_x to account for out-of-order cells in the pdf
|
| 87 |
-
elif x_start < cell_x_end - space_tol and x_start < new_line_start_x:
|
| 88 |
-
if len(table_cell) > 0:
|
| 89 |
-
table_row.append((cell_bbox, cell_content))
|
| 90 |
-
table_cell = char["char"]
|
| 91 |
-
cell_bbox = char["bbox"]
|
| 92 |
-
if len(table_row) > 0:
|
| 93 |
-
table_row = sorted(table_row, key=lambda x: round(x[0][0] / round_factor))
|
| 94 |
-
table_rows.append(table_row)
|
| 95 |
-
table_row = []
|
| 96 |
-
else: # Same line, new cell, check against cell bbox
|
| 97 |
-
if len(table_cell) > 0:
|
| 98 |
-
table_row.append((cell_bbox, cell_content))
|
| 99 |
-
table_cell = char["char"]
|
| 100 |
-
cell_bbox = char["bbox"]
|
| 101 |
-
|
| 102 |
-
if len(table_cell) > 0:
|
| 103 |
-
table_row.append((cell_bbox, replace_dots(replace_newlines(table_cell))))
|
| 104 |
-
if len(table_row) > 0:
|
| 105 |
-
table_row = sorted(table_row, key=lambda x: round(x[0][0] / round_factor))
|
| 106 |
-
table_rows.append(table_row)
|
| 107 |
-
|
| 108 |
-
total_cells = sum([len(row) for row in table_rows])
|
| 109 |
-
if total_cells > 0:
|
| 110 |
-
table_rows = assign_cells_to_columns(page, table_box, table_rows)
|
| 111 |
-
return table_rows
|
| 112 |
-
else:
|
| 113 |
-
return []
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
def merge_tables(page_table_boxes):
|
| 117 |
-
# Merge tables that are next to each other
|
| 118 |
-
expansion_factor = 1.02
|
| 119 |
-
shrink_factor = .98
|
| 120 |
-
ignore_boxes = set()
|
| 121 |
-
for i in range(len(page_table_boxes)):
|
| 122 |
-
if i in ignore_boxes:
|
| 123 |
continue
|
| 124 |
-
for j in range(i + 1, len(page_table_boxes)):
|
| 125 |
-
if j in ignore_boxes:
|
| 126 |
-
continue
|
| 127 |
-
expanded_box1 = [page_table_boxes[i][0] * shrink_factor, page_table_boxes[i][1],
|
| 128 |
-
page_table_boxes[i][2] * expansion_factor, page_table_boxes[i][3]]
|
| 129 |
-
expanded_box2 = [page_table_boxes[j][0] * shrink_factor, page_table_boxes[j][1],
|
| 130 |
-
page_table_boxes[j][2] * expansion_factor, page_table_boxes[j][3]]
|
| 131 |
-
if box_intersection_pct(expanded_box1, expanded_box2) > 0:
|
| 132 |
-
page_table_boxes[i] = merge_boxes(page_table_boxes[i], page_table_boxes[j])
|
| 133 |
-
ignore_boxes.add(j)
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
-
def format_tables(pages: List[Page]):
|
| 139 |
-
# Formats tables nicely into github flavored markdown
|
| 140 |
table_count = 0
|
| 141 |
-
for page in pages:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
table_insert_points = {}
|
| 143 |
blocks_to_remove = set()
|
| 144 |
pnum = page.pnum
|
| 145 |
-
|
| 146 |
-
page_table_boxes = [
|
| 147 |
-
page_table_boxes = [rescale_bbox(page.layout.image_bbox, page.bbox, b.bbox) for b in page_table_boxes]
|
| 148 |
-
page_table_boxes = merge_tables(page_table_boxes)
|
| 149 |
|
| 150 |
for table_idx, table_box in enumerate(page_table_boxes):
|
|
|
|
|
|
|
| 151 |
for block_idx, block in enumerate(page.blocks):
|
| 152 |
-
intersect_pct = block.intersection_pct(
|
| 153 |
if intersect_pct > settings.TABLE_INTERSECTION_THRESH and block.block_type == "Table":
|
| 154 |
if table_idx not in table_insert_points:
|
| 155 |
table_insert_points[table_idx] = max(0, block_idx - len(blocks_to_remove)) # Where to insert the new table
|
|
@@ -163,17 +115,10 @@ def format_tables(pages: List[Page]):
|
|
| 163 |
|
| 164 |
for table_idx, table_box in enumerate(page_table_boxes):
|
| 165 |
if table_idx not in table_insert_points:
|
|
|
|
| 166 |
continue
|
| 167 |
|
| 168 |
-
|
| 169 |
-
table_rows = get_table_surya(page, table_box)
|
| 170 |
-
else:
|
| 171 |
-
table_rows = get_table_pdftext(page, table_box)
|
| 172 |
-
# Skip empty tables
|
| 173 |
-
if len(table_rows) == 0:
|
| 174 |
-
continue
|
| 175 |
-
|
| 176 |
-
table_text = tabulate(table_rows, headers="firstrow", tablefmt="github", disable_numparse=True)
|
| 177 |
table_block = Block(
|
| 178 |
bbox=table_box,
|
| 179 |
block_type="Table",
|
|
@@ -187,7 +132,7 @@ def format_tables(pages: List[Page]):
|
|
| 187 |
font_size=0,
|
| 188 |
font_weight=0,
|
| 189 |
block_type="Table",
|
| 190 |
-
text=
|
| 191 |
)]
|
| 192 |
)]
|
| 193 |
)
|
|
|
|
| 1 |
+
from tqdm import tqdm
|
| 2 |
+
from pypdfium2 import PdfDocument
|
| 3 |
+
from tabled.assignment import assign_rows_columns
|
| 4 |
+
from tabled.formats import formatter
|
| 5 |
+
from tabled.inference.detection import merge_tables
|
| 6 |
+
|
| 7 |
+
from surya.input.pdflines import get_page_text_lines
|
| 8 |
+
from tabled.inference.recognition import get_cells, recognize_tables
|
| 9 |
+
|
| 10 |
+
from marker.pdf.images import render_image
|
| 11 |
+
from marker.schema.bbox import rescale_bbox
|
| 12 |
from marker.schema.block import Line, Span, Block
|
| 13 |
from marker.schema.page import Page
|
|
|
|
| 14 |
from typing import List
|
| 15 |
|
| 16 |
from marker.settings import settings
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def get_table_boxes(pages: List[Page], doc: PdfDocument, fname):
|
| 20 |
+
table_imgs = []
|
| 21 |
+
table_counts = []
|
| 22 |
+
table_bboxes = []
|
| 23 |
+
img_sizes = []
|
| 24 |
+
|
| 25 |
+
for page in pages:
|
| 26 |
+
pnum = page.pnum
|
| 27 |
+
# The bbox for the entire table
|
| 28 |
+
bbox = [b.bbox for b in page.layout.bboxes if b.label == "Table"]
|
| 29 |
+
|
| 30 |
+
if len(bbox) == 0:
|
| 31 |
+
table_counts.append(0)
|
| 32 |
+
img_sizes.append(None)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
continue
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
highres_img = render_image(doc[pnum], dpi=settings.SURYA_TABLE_DPI)
|
| 36 |
+
|
| 37 |
+
page_table_imgs = []
|
| 38 |
+
lowres_bbox = []
|
| 39 |
+
|
| 40 |
+
# Merge tables that are next to each other
|
| 41 |
+
bbox = merge_tables(bbox)
|
| 42 |
+
|
| 43 |
+
# Number of tables per page
|
| 44 |
+
table_counts.append(len(bbox))
|
| 45 |
+
img_sizes.append(highres_img.size)
|
| 46 |
+
|
| 47 |
+
for bb in bbox:
|
| 48 |
+
highres_bb = rescale_bbox(page.layout.image_bbox, [0, 0, highres_img.size[0], highres_img.size[1]], bb)
|
| 49 |
+
page_table_imgs.append(highres_img.crop(highres_bb))
|
| 50 |
+
lowres_bbox.append(highres_bb)
|
| 51 |
|
| 52 |
+
table_imgs.extend(page_table_imgs)
|
| 53 |
+
table_bboxes.extend(lowres_bbox)
|
| 54 |
+
|
| 55 |
+
table_idxs = [i for i, c in enumerate(table_counts) if c > 0]
|
| 56 |
+
sel_text_lines = get_page_text_lines(
|
| 57 |
+
fname,
|
| 58 |
+
table_idxs,
|
| 59 |
+
[hr for i, hr in enumerate(img_sizes) if i in table_idxs],
|
| 60 |
+
)
|
| 61 |
+
text_lines = []
|
| 62 |
+
out_img_sizes = []
|
| 63 |
+
for i in range(len(table_counts)):
|
| 64 |
+
if i in table_idxs:
|
| 65 |
+
text_lines.extend([sel_text_lines.pop(0)] * table_counts[i])
|
| 66 |
+
out_img_sizes.extend([img_sizes[i]] * table_counts[i])
|
| 67 |
+
|
| 68 |
+
assert len(table_imgs) == len(table_bboxes) == len(text_lines) == len(out_img_sizes)
|
| 69 |
+
assert sum(table_counts) == len(table_imgs)
|
| 70 |
+
|
| 71 |
+
return table_imgs, table_bboxes, table_counts, text_lines, out_img_sizes
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def format_tables(pages: List[Page], doc: PdfDocument, fname: str, detection_model, table_rec_model, ocr_model):
|
| 75 |
+
det_models = [detection_model, detection_model.processor]
|
| 76 |
+
rec_models = [table_rec_model, table_rec_model.processor, ocr_model, ocr_model.processor]
|
| 77 |
+
|
| 78 |
+
# Don't look at table cell detection tqdm output
|
| 79 |
+
tqdm.disable = True
|
| 80 |
+
table_imgs, table_boxes, table_counts, table_text_lines, img_sizes = get_table_boxes(pages, doc, fname)
|
| 81 |
+
cells, needs_ocr = get_cells(table_imgs, table_boxes, img_sizes, table_text_lines, det_models, detect_boxes=settings.OCR_ALL_PAGES)
|
| 82 |
+
tqdm.disable = False
|
| 83 |
+
|
| 84 |
+
table_rec = recognize_tables(table_imgs, cells, needs_ocr, rec_models)
|
| 85 |
+
cells = [assign_rows_columns(tr, im_size) for tr, im_size in zip(table_rec, img_sizes)]
|
| 86 |
+
table_md = [formatter("markdown", cell)[0] for cell in cells]
|
| 87 |
|
|
|
|
|
|
|
| 88 |
table_count = 0
|
| 89 |
+
for page_idx, page in enumerate(pages):
|
| 90 |
+
page_table_count = table_counts[page_idx]
|
| 91 |
+
if page_table_count == 0:
|
| 92 |
+
continue
|
| 93 |
+
|
| 94 |
table_insert_points = {}
|
| 95 |
blocks_to_remove = set()
|
| 96 |
pnum = page.pnum
|
| 97 |
+
highres_size = img_sizes[table_count]
|
| 98 |
+
page_table_boxes = table_boxes[table_count:table_count + page_table_count]
|
|
|
|
|
|
|
| 99 |
|
| 100 |
for table_idx, table_box in enumerate(page_table_boxes):
|
| 101 |
+
lowres_table_box = rescale_bbox([0, 0, highres_size[0], highres_size[1]], page.bbox, table_box)
|
| 102 |
+
|
| 103 |
for block_idx, block in enumerate(page.blocks):
|
| 104 |
+
intersect_pct = block.intersection_pct(lowres_table_box)
|
| 105 |
if intersect_pct > settings.TABLE_INTERSECTION_THRESH and block.block_type == "Table":
|
| 106 |
if table_idx not in table_insert_points:
|
| 107 |
table_insert_points[table_idx] = max(0, block_idx - len(blocks_to_remove)) # Where to insert the new table
|
|
|
|
| 115 |
|
| 116 |
for table_idx, table_box in enumerate(page_table_boxes):
|
| 117 |
if table_idx not in table_insert_points:
|
| 118 |
+
table_count += 1
|
| 119 |
continue
|
| 120 |
|
| 121 |
+
markdown = table_md[table_count]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
table_block = Block(
|
| 123 |
bbox=table_box,
|
| 124 |
block_type="Table",
|
|
|
|
| 132 |
font_size=0,
|
| 133 |
font_weight=0,
|
| 134 |
block_type="Table",
|
| 135 |
+
text=markdown
|
| 136 |
)]
|
| 137 |
)]
|
| 138 |
)
|
poetry.lock
CHANGED
|
@@ -601,6 +601,23 @@ files = [
|
|
| 601 |
{file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"},
|
| 602 |
]
|
| 603 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 604 |
[[package]]
|
| 605 |
name = "comm"
|
| 606 |
version = "0.2.2"
|
|
@@ -799,6 +816,17 @@ files = [
|
|
| 799 |
{file = "filetype-1.2.0.tar.gz", hash = "sha256:66b56cd6474bf41d8c54660347d37afcc3f7d1970648de365c102ef77548aadb"},
|
| 800 |
]
|
| 801 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 802 |
[[package]]
|
| 803 |
name = "fqdn"
|
| 804 |
version = "1.5.1"
|
|
@@ -1132,6 +1160,20 @@ testing = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "fastapi", "gr
|
|
| 1132 |
torch = ["safetensors", "torch"]
|
| 1133 |
typing = ["types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)"]
|
| 1134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1135 |
[[package]]
|
| 1136 |
name = "idna"
|
| 1137 |
version = "3.7"
|
|
@@ -1143,25 +1185,6 @@ files = [
|
|
| 1143 |
{file = "idna-3.7.tar.gz", hash = "sha256:028ff3aadf0609c1fd278d8ea3089299412a7a8b9bd005dd08b9f8285bcb5cfc"},
|
| 1144 |
]
|
| 1145 |
|
| 1146 |
-
[[package]]
|
| 1147 |
-
name = "importlib-metadata"
|
| 1148 |
-
version = "8.0.0"
|
| 1149 |
-
description = "Read metadata from Python packages"
|
| 1150 |
-
optional = false
|
| 1151 |
-
python-versions = ">=3.8"
|
| 1152 |
-
files = [
|
| 1153 |
-
{file = "importlib_metadata-8.0.0-py3-none-any.whl", hash = "sha256:15584cf2b1bf449d98ff8a6ff1abef57bf20f3ac6454f431736cd3e660921b2f"},
|
| 1154 |
-
{file = "importlib_metadata-8.0.0.tar.gz", hash = "sha256:188bd24e4c346d3f0a933f275c2fec67050326a856b9a359881d7c2a697e8812"},
|
| 1155 |
-
]
|
| 1156 |
-
|
| 1157 |
-
[package.dependencies]
|
| 1158 |
-
zipp = ">=0.5"
|
| 1159 |
-
|
| 1160 |
-
[package.extras]
|
| 1161 |
-
doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"]
|
| 1162 |
-
perf = ["ipython"]
|
| 1163 |
-
test = ["flufl.flake8", "importlib-resources (>=1.3)", "jaraco.test (>=5.4)", "packaging", "pyfakefs", "pytest (>=6,!=8.1.*)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)", "pytest-mypy", "pytest-perf (>=0.9.2)", "pytest-ruff (>=0.2.1)"]
|
| 1164 |
-
|
| 1165 |
[[package]]
|
| 1166 |
name = "intel-openmp"
|
| 1167 |
version = "2021.4.0"
|
|
@@ -1231,7 +1254,6 @@ prompt-toolkit = ">=3.0.41,<3.1.0"
|
|
| 1231 |
pygments = ">=2.4.0"
|
| 1232 |
stack-data = "*"
|
| 1233 |
traitlets = ">=5"
|
| 1234 |
-
typing-extensions = {version = "*", markers = "python_version < \"3.10\""}
|
| 1235 |
|
| 1236 |
[package.extras]
|
| 1237 |
all = ["black", "curio", "docrepr", "exceptiongroup", "ipykernel", "ipyparallel", "ipywidgets", "matplotlib", "matplotlib (!=3.2.0)", "nbconvert", "nbformat", "notebook", "numpy (>=1.22)", "pandas", "pickleshare", "pytest (<7)", "pytest (<7.1)", "pytest-asyncio (<0.22)", "qtconsole", "setuptools (>=18.5)", "sphinx (>=1.3)", "sphinx-rtd-theme", "stack-data", "testpath", "trio", "typing-extensions"]
|
|
@@ -1425,7 +1447,6 @@ files = [
|
|
| 1425 |
]
|
| 1426 |
|
| 1427 |
[package.dependencies]
|
| 1428 |
-
importlib-metadata = {version = ">=4.8.3", markers = "python_version < \"3.10\""}
|
| 1429 |
jupyter-core = ">=4.12,<5.0.dev0 || >=5.1.dev0"
|
| 1430 |
python-dateutil = ">=2.8.2"
|
| 1431 |
pyzmq = ">=23.0"
|
|
@@ -1517,7 +1538,6 @@ files = [
|
|
| 1517 |
]
|
| 1518 |
|
| 1519 |
[package.dependencies]
|
| 1520 |
-
importlib-metadata = {version = ">=4.8.3", markers = "python_version < \"3.10\""}
|
| 1521 |
jupyter-server = ">=1.1.2"
|
| 1522 |
|
| 1523 |
[[package]]
|
|
@@ -1589,7 +1609,6 @@ files = [
|
|
| 1589 |
[package.dependencies]
|
| 1590 |
async-lru = ">=1.0.0"
|
| 1591 |
httpx = ">=0.25.0"
|
| 1592 |
-
importlib-metadata = {version = ">=4.8.3", markers = "python_version < \"3.10\""}
|
| 1593 |
ipykernel = ">=6.5.0"
|
| 1594 |
jinja2 = ">=3.0.3"
|
| 1595 |
jupyter-core = "*"
|
|
@@ -1634,7 +1653,6 @@ files = [
|
|
| 1634 |
|
| 1635 |
[package.dependencies]
|
| 1636 |
babel = ">=2.10"
|
| 1637 |
-
importlib-metadata = {version = ">=4.8.3", markers = "python_version < \"3.10\""}
|
| 1638 |
jinja2 = ">=3.0.3"
|
| 1639 |
json5 = ">=0.9.0"
|
| 1640 |
jsonschema = ">=4.18.0"
|
|
@@ -1999,7 +2017,6 @@ files = [
|
|
| 1999 |
beautifulsoup4 = "*"
|
| 2000 |
bleach = "!=5.0.0"
|
| 2001 |
defusedxml = "*"
|
| 2002 |
-
importlib-metadata = {version = ">=3.6", markers = "python_version < \"3.10\""}
|
| 2003 |
jinja2 = ">=3.0"
|
| 2004 |
jupyter-core = ">=4.7"
|
| 2005 |
jupyterlab-pygments = "*"
|
|
@@ -2284,6 +2301,7 @@ description = "Nvidia JIT LTO Library"
|
|
| 2284 |
optional = false
|
| 2285 |
python-versions = ">=3"
|
| 2286 |
files = [
|
|
|
|
| 2287 |
{file = "nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl", hash = "sha256:f9b37bc5c8cf7509665cb6ada5aaa0ce65618f2332b7d3e78e9790511f111212"},
|
| 2288 |
{file = "nvidia_nvjitlink_cu12-12.5.82-py3-none-win_amd64.whl", hash = "sha256:e782564d705ff0bf61ac3e1bf730166da66dd2fe9012f111ede5fc49b64ae697"},
|
| 2289 |
]
|
|
@@ -2299,6 +2317,48 @@ files = [
|
|
| 2299 |
{file = "nvidia_nvtx_cu12-12.1.105-py3-none-win_amd64.whl", hash = "sha256:65f4d98982b31b60026e0e6de73fbdfc09d08a96f4656dd3665ca616a11e1e82"},
|
| 2300 |
]
|
| 2301 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2302 |
[[package]]
|
| 2303 |
name = "opencv-python"
|
| 2304 |
version = "4.10.0.84"
|
|
@@ -2317,12 +2377,10 @@ files = [
|
|
| 2317 |
|
| 2318 |
[package.dependencies]
|
| 2319 |
numpy = [
|
| 2320 |
-
{version = ">=1.
|
| 2321 |
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\" and python_version < \"3.11\""},
|
| 2322 |
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\" and python_version < \"3.11\""},
|
| 2323 |
-
{version = ">=1.19.3", markers = "platform_system == \"Linux\" and platform_machine == \"aarch64\" and python_version >= \"3.8\" and python_version < \"3.10\" or python_version > \"3.9\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_system != \"Darwin\" and python_version < \"3.10\" or python_version >= \"3.9\" and platform_machine != \"arm64\" and python_version < \"3.10\""},
|
| 2324 |
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
|
| 2325 |
-
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 2326 |
]
|
| 2327 |
|
| 2328 |
[[package]]
|
|
@@ -2387,9 +2445,9 @@ files = [
|
|
| 2387 |
|
| 2388 |
[package.dependencies]
|
| 2389 |
numpy = [
|
|
|
|
| 2390 |
{version = ">=1.22.4", markers = "python_version < \"3.11\""},
|
| 2391 |
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
|
| 2392 |
-
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 2393 |
]
|
| 2394 |
python-dateutil = ">=2.8.2"
|
| 2395 |
pytz = ">=2020.1"
|
|
@@ -2448,20 +2506,20 @@ testing = ["docopt", "pytest"]
|
|
| 2448 |
|
| 2449 |
[[package]]
|
| 2450 |
name = "pdftext"
|
| 2451 |
-
version = "0.3.
|
| 2452 |
description = "Extract structured text from pdfs quickly"
|
| 2453 |
optional = false
|
| 2454 |
python-versions = "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9"
|
| 2455 |
files = [
|
| 2456 |
-
{file = "pdftext-0.3.
|
| 2457 |
-
{file = "pdftext-0.3.
|
| 2458 |
]
|
| 2459 |
|
| 2460 |
[package.dependencies]
|
|
|
|
| 2461 |
pydantic = ">=2.7.1,<3.0.0"
|
| 2462 |
pydantic-settings = ">=2.2.1,<3.0.0"
|
| 2463 |
pypdfium2 = ">=4.29.0,<5.0.0"
|
| 2464 |
-
scikit-learn = ">=1.4.2,<2.0.0"
|
| 2465 |
|
| 2466 |
[[package]]
|
| 2467 |
name = "pexpect"
|
|
@@ -2756,119 +2814,123 @@ files = [
|
|
| 2756 |
|
| 2757 |
[[package]]
|
| 2758 |
name = "pydantic"
|
| 2759 |
-
version = "2.
|
| 2760 |
description = "Data validation using Python type hints"
|
| 2761 |
optional = false
|
| 2762 |
python-versions = ">=3.8"
|
| 2763 |
files = [
|
| 2764 |
-
{file = "pydantic-2.
|
| 2765 |
-
{file = "pydantic-2.
|
| 2766 |
]
|
| 2767 |
|
| 2768 |
[package.dependencies]
|
| 2769 |
-
annotated-types = ">=0.
|
| 2770 |
-
pydantic-core = "2.
|
| 2771 |
-
typing-extensions =
|
|
|
|
|
|
|
|
|
|
| 2772 |
|
| 2773 |
[package.extras]
|
| 2774 |
email = ["email-validator (>=2.0.0)"]
|
|
|
|
| 2775 |
|
| 2776 |
[[package]]
|
| 2777 |
name = "pydantic-core"
|
| 2778 |
-
version = "2.
|
| 2779 |
description = "Core functionality for Pydantic validation and serialization"
|
| 2780 |
optional = false
|
| 2781 |
python-versions = ">=3.8"
|
| 2782 |
files = [
|
| 2783 |
-
{file = "pydantic_core-2.
|
| 2784 |
-
{file = "pydantic_core-2.
|
| 2785 |
-
{file = "pydantic_core-2.
|
| 2786 |
-
{file = "pydantic_core-2.
|
| 2787 |
-
{file = "pydantic_core-2.
|
| 2788 |
-
{file = "pydantic_core-2.
|
| 2789 |
-
{file = "pydantic_core-2.
|
| 2790 |
-
{file = "pydantic_core-2.
|
| 2791 |
-
{file = "pydantic_core-2.
|
| 2792 |
-
{file = "pydantic_core-2.
|
| 2793 |
-
{file = "pydantic_core-2.
|
| 2794 |
-
{file = "pydantic_core-2.
|
| 2795 |
-
{file = "pydantic_core-2.
|
| 2796 |
-
{file = "pydantic_core-2.
|
| 2797 |
-
{file = "pydantic_core-2.
|
| 2798 |
-
{file = "pydantic_core-2.
|
| 2799 |
-
{file = "pydantic_core-2.
|
| 2800 |
-
{file = "pydantic_core-2.
|
| 2801 |
-
{file = "pydantic_core-2.
|
| 2802 |
-
{file = "pydantic_core-2.
|
| 2803 |
-
{file = "pydantic_core-2.
|
| 2804 |
-
{file = "pydantic_core-2.
|
| 2805 |
-
{file = "pydantic_core-2.
|
| 2806 |
-
{file = "pydantic_core-2.
|
| 2807 |
-
{file = "pydantic_core-2.
|
| 2808 |
-
{file = "pydantic_core-2.
|
| 2809 |
-
{file = "pydantic_core-2.
|
| 2810 |
-
{file = "pydantic_core-2.
|
| 2811 |
-
{file = "pydantic_core-2.
|
| 2812 |
-
{file = "pydantic_core-2.
|
| 2813 |
-
{file = "pydantic_core-2.
|
| 2814 |
-
{file = "pydantic_core-2.
|
| 2815 |
-
{file = "pydantic_core-2.
|
| 2816 |
-
{file = "pydantic_core-2.
|
| 2817 |
-
{file = "pydantic_core-2.
|
| 2818 |
-
{file = "pydantic_core-2.
|
| 2819 |
-
{file = "pydantic_core-2.
|
| 2820 |
-
{file = "pydantic_core-2.
|
| 2821 |
-
{file = "pydantic_core-2.
|
| 2822 |
-
{file = "pydantic_core-2.
|
| 2823 |
-
{file = "pydantic_core-2.
|
| 2824 |
-
{file = "pydantic_core-2.
|
| 2825 |
-
{file = "pydantic_core-2.
|
| 2826 |
-
{file = "pydantic_core-2.
|
| 2827 |
-
{file = "pydantic_core-2.
|
| 2828 |
-
{file = "pydantic_core-2.
|
| 2829 |
-
{file = "pydantic_core-2.
|
| 2830 |
-
{file = "pydantic_core-2.
|
| 2831 |
-
{file = "pydantic_core-2.
|
| 2832 |
-
{file = "pydantic_core-2.
|
| 2833 |
-
{file = "pydantic_core-2.
|
| 2834 |
-
{file = "pydantic_core-2.
|
| 2835 |
-
{file = "pydantic_core-2.
|
| 2836 |
-
{file = "pydantic_core-2.
|
| 2837 |
-
{file = "pydantic_core-2.
|
| 2838 |
-
{file = "pydantic_core-2.
|
| 2839 |
-
{file = "pydantic_core-2.
|
| 2840 |
-
{file = "pydantic_core-2.
|
| 2841 |
-
{file = "pydantic_core-2.
|
| 2842 |
-
{file = "pydantic_core-2.
|
| 2843 |
-
{file = "pydantic_core-2.
|
| 2844 |
-
{file = "pydantic_core-2.
|
| 2845 |
-
{file = "pydantic_core-2.
|
| 2846 |
-
{file = "pydantic_core-2.
|
| 2847 |
-
{file = "pydantic_core-2.
|
| 2848 |
-
{file = "pydantic_core-2.
|
| 2849 |
-
{file = "pydantic_core-2.
|
| 2850 |
-
{file = "pydantic_core-2.
|
| 2851 |
-
{file = "pydantic_core-2.
|
| 2852 |
-
{file = "pydantic_core-2.
|
| 2853 |
-
{file = "pydantic_core-2.
|
| 2854 |
-
{file = "pydantic_core-2.
|
| 2855 |
-
{file = "pydantic_core-2.
|
| 2856 |
-
{file = "pydantic_core-2.
|
| 2857 |
-
{file = "pydantic_core-2.
|
| 2858 |
-
{file = "pydantic_core-2.
|
| 2859 |
-
{file = "pydantic_core-2.
|
| 2860 |
-
{file = "pydantic_core-2.
|
| 2861 |
-
{file = "pydantic_core-2.
|
| 2862 |
-
{file = "pydantic_core-2.
|
| 2863 |
-
{file = "pydantic_core-2.
|
| 2864 |
-
{file = "pydantic_core-2.
|
| 2865 |
-
{file = "pydantic_core-2.
|
| 2866 |
-
{file = "pydantic_core-2.
|
| 2867 |
-
{file = "pydantic_core-2.
|
| 2868 |
-
{file = "pydantic_core-2.
|
| 2869 |
-
{file = "pydantic_core-2.
|
| 2870 |
-
{file = "pydantic_core-2.
|
| 2871 |
-
{file = "pydantic_core-2.
|
| 2872 |
]
|
| 2873 |
|
| 2874 |
[package.dependencies]
|
|
@@ -2876,13 +2938,13 @@ typing-extensions = ">=4.6.0,<4.7.0 || >4.7.0"
|
|
| 2876 |
|
| 2877 |
[[package]]
|
| 2878 |
name = "pydantic-settings"
|
| 2879 |
-
version = "2.
|
| 2880 |
description = "Settings management using Pydantic"
|
| 2881 |
optional = false
|
| 2882 |
python-versions = ">=3.8"
|
| 2883 |
files = [
|
| 2884 |
-
{file = "pydantic_settings-2.
|
| 2885 |
-
{file = "pydantic_settings-2.
|
| 2886 |
]
|
| 2887 |
|
| 2888 |
[package.dependencies]
|
|
@@ -2949,6 +3011,20 @@ files = [
|
|
| 2949 |
{file = "pypdfium2-4.30.0.tar.gz", hash = "sha256:48b5b7e5566665bc1015b9d69c1ebabe21f6aee468b509531c3c8318eeee2e16"},
|
| 2950 |
]
|
| 2951 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2952 |
[[package]]
|
| 2953 |
name = "python-dateutil"
|
| 2954 |
version = "2.9.0.post0"
|
|
@@ -3743,87 +3819,103 @@ torch = ["safetensors[numpy]", "torch (>=1.10)"]
|
|
| 3743 |
|
| 3744 |
[[package]]
|
| 3745 |
name = "scikit-learn"
|
| 3746 |
-
version = "1.
|
| 3747 |
description = "A set of python modules for machine learning and data mining"
|
| 3748 |
optional = false
|
| 3749 |
python-versions = ">=3.9"
|
| 3750 |
files = [
|
| 3751 |
-
{file = "
|
| 3752 |
-
{file = "scikit_learn-1.
|
| 3753 |
-
{file = "scikit_learn-1.
|
| 3754 |
-
{file = "scikit_learn-1.
|
| 3755 |
-
{file = "scikit_learn-1.
|
| 3756 |
-
{file = "scikit_learn-1.
|
| 3757 |
-
{file = "scikit_learn-1.
|
| 3758 |
-
{file = "scikit_learn-1.
|
| 3759 |
-
{file = "scikit_learn-1.
|
| 3760 |
-
{file = "scikit_learn-1.
|
| 3761 |
-
{file = "scikit_learn-1.
|
| 3762 |
-
{file = "scikit_learn-1.
|
| 3763 |
-
{file = "scikit_learn-1.
|
| 3764 |
-
{file = "scikit_learn-1.
|
| 3765 |
-
{file = "scikit_learn-1.
|
| 3766 |
-
{file = "scikit_learn-1.
|
| 3767 |
-
{file = "scikit_learn-1.
|
| 3768 |
-
{file = "scikit_learn-1.
|
| 3769 |
-
{file = "scikit_learn-1.
|
| 3770 |
-
{file = "scikit_learn-1.
|
| 3771 |
-
{file = "scikit_learn-1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3772 |
]
|
| 3773 |
|
| 3774 |
[package.dependencies]
|
| 3775 |
joblib = ">=1.2.0"
|
| 3776 |
numpy = ">=1.19.5"
|
| 3777 |
scipy = ">=1.6.0"
|
| 3778 |
-
threadpoolctl = ">=
|
| 3779 |
|
| 3780 |
[package.extras]
|
| 3781 |
-
benchmark = ["matplotlib (>=3.3.4)", "
|
| 3782 |
-
|
|
|
|
| 3783 |
examples = ["matplotlib (>=3.3.4)", "pandas (>=1.1.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.17.2)", "seaborn (>=0.9.0)"]
|
| 3784 |
-
|
|
|
|
|
|
|
| 3785 |
|
| 3786 |
[[package]]
|
| 3787 |
name = "scipy"
|
| 3788 |
-
version = "1.
|
| 3789 |
description = "Fundamental algorithms for scientific computing in Python"
|
| 3790 |
optional = false
|
| 3791 |
-
python-versions = ">=3.
|
| 3792 |
-
files = [
|
| 3793 |
-
{file = "scipy-1.
|
| 3794 |
-
{file = "scipy-1.
|
| 3795 |
-
{file = "scipy-1.
|
| 3796 |
-
{file = "scipy-1.
|
| 3797 |
-
{file = "scipy-1.
|
| 3798 |
-
{file = "scipy-1.
|
| 3799 |
-
{file = "scipy-1.
|
| 3800 |
-
{file = "scipy-1.
|
| 3801 |
-
{file = "scipy-1.
|
| 3802 |
-
{file = "scipy-1.
|
| 3803 |
-
{file = "scipy-1.
|
| 3804 |
-
{file = "scipy-1.
|
| 3805 |
-
{file = "scipy-1.
|
| 3806 |
-
{file = "scipy-1.
|
| 3807 |
-
{file = "scipy-1.
|
| 3808 |
-
{file = "scipy-1.
|
| 3809 |
-
{file = "scipy-1.
|
| 3810 |
-
{file = "scipy-1.
|
| 3811 |
-
{file = "scipy-1.
|
| 3812 |
-
{file = "scipy-1.
|
| 3813 |
-
{file = "scipy-1.
|
| 3814 |
-
{file = "scipy-1.
|
| 3815 |
-
{file = "scipy-1.
|
| 3816 |
-
{file = "scipy-1.
|
| 3817 |
-
{file = "scipy-1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3818 |
]
|
| 3819 |
|
| 3820 |
[package.dependencies]
|
| 3821 |
-
numpy = ">=1.
|
| 3822 |
|
| 3823 |
[package.extras]
|
| 3824 |
-
dev = ["cython-lint (>=0.12.2)", "doit (>=0.36.0)", "mypy", "pycodestyle", "pydevtool", "rich-click", "ruff", "types-psutil", "typing_extensions"]
|
| 3825 |
-
doc = ["jupyterlite-pyodide-kernel", "jupyterlite-sphinx (>=0.
|
| 3826 |
-
test = ["array-api-strict", "asv", "gmpy2", "hypothesis (>=6.30)", "mpmath", "pooch", "pytest", "pytest-cov", "pytest-timeout", "pytest-xdist", "scikit-umfpack", "threadpoolctl"]
|
| 3827 |
|
| 3828 |
[[package]]
|
| 3829 |
name = "send2trash"
|
|
@@ -3956,19 +4048,20 @@ snowflake = ["snowflake-connector-python (>=2.8.0)", "snowflake-snowpark-python
|
|
| 3956 |
|
| 3957 |
[[package]]
|
| 3958 |
name = "surya-ocr"
|
| 3959 |
-
version = "0.
|
| 3960 |
-
description = "OCR, layout, reading order, and
|
| 3961 |
optional = false
|
| 3962 |
-
python-versions = "
|
| 3963 |
files = [
|
| 3964 |
-
{file = "surya_ocr-0.
|
| 3965 |
-
{file = "surya_ocr-0.
|
| 3966 |
]
|
| 3967 |
|
| 3968 |
[package.dependencies]
|
| 3969 |
filetype = ">=1.2.0,<2.0.0"
|
| 3970 |
ftfy = ">=6.1.3,<7.0.0"
|
| 3971 |
opencv-python = ">=4.9.0.80,<5.0.0.0"
|
|
|
|
| 3972 |
pillow = ">=10.2.0,<11.0.0"
|
| 3973 |
pydantic = ">=2.5.3,<3.0.0"
|
| 3974 |
pydantic-settings = ">=2.1.0,<3.0.0"
|
|
@@ -3995,6 +4088,27 @@ mpmath = ">=1.1.0,<1.4"
|
|
| 3995 |
[package.extras]
|
| 3996 |
dev = ["hypothesis (>=6.70.0)", "pytest (>=7.1.0)"]
|
| 3997 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3998 |
[[package]]
|
| 3999 |
name = "tabulate"
|
| 4000 |
version = "0.9.0"
|
|
@@ -4858,22 +4972,7 @@ files = [
|
|
| 4858 |
idna = ">=2.0"
|
| 4859 |
multidict = ">=4.0"
|
| 4860 |
|
| 4861 |
-
[[package]]
|
| 4862 |
-
name = "zipp"
|
| 4863 |
-
version = "3.19.2"
|
| 4864 |
-
description = "Backport of pathlib-compatible object wrapper for zip files"
|
| 4865 |
-
optional = false
|
| 4866 |
-
python-versions = ">=3.8"
|
| 4867 |
-
files = [
|
| 4868 |
-
{file = "zipp-3.19.2-py3-none-any.whl", hash = "sha256:f091755f667055f2d02b32c53771a7a6c8b47e1fdbc4b72a8b9072b3eef8015c"},
|
| 4869 |
-
{file = "zipp-3.19.2.tar.gz", hash = "sha256:bf1dcf6450f873a13e952a29504887c89e6de7506209e5b1bcc3460135d4de19"},
|
| 4870 |
-
]
|
| 4871 |
-
|
| 4872 |
-
[package.extras]
|
| 4873 |
-
doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"]
|
| 4874 |
-
test = ["big-O", "importlib-resources", "jaraco.functools", "jaraco.itertools", "jaraco.test", "more-itertools", "pytest (>=6,!=8.1.*)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)", "pytest-ignore-flaky", "pytest-mypy", "pytest-ruff (>=0.2.1)"]
|
| 4875 |
-
|
| 4876 |
[metadata]
|
| 4877 |
lock-version = "2.0"
|
| 4878 |
-
python-versions = "
|
| 4879 |
-
content-hash = "
|
|
|
|
| 601 |
{file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"},
|
| 602 |
]
|
| 603 |
|
| 604 |
+
[[package]]
|
| 605 |
+
name = "coloredlogs"
|
| 606 |
+
version = "15.0.1"
|
| 607 |
+
description = "Colored terminal output for Python's logging module"
|
| 608 |
+
optional = false
|
| 609 |
+
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
|
| 610 |
+
files = [
|
| 611 |
+
{file = "coloredlogs-15.0.1-py2.py3-none-any.whl", hash = "sha256:612ee75c546f53e92e70049c9dbfcc18c935a2b9a53b66085ce9ef6a6e5c0934"},
|
| 612 |
+
{file = "coloredlogs-15.0.1.tar.gz", hash = "sha256:7c991aa71a4577af2f82600d8f8f3a89f936baeaf9b50a9c197da014e5bf16b0"},
|
| 613 |
+
]
|
| 614 |
+
|
| 615 |
+
[package.dependencies]
|
| 616 |
+
humanfriendly = ">=9.1"
|
| 617 |
+
|
| 618 |
+
[package.extras]
|
| 619 |
+
cron = ["capturer (>=2.4)"]
|
| 620 |
+
|
| 621 |
[[package]]
|
| 622 |
name = "comm"
|
| 623 |
version = "0.2.2"
|
|
|
|
| 816 |
{file = "filetype-1.2.0.tar.gz", hash = "sha256:66b56cd6474bf41d8c54660347d37afcc3f7d1970648de365c102ef77548aadb"},
|
| 817 |
]
|
| 818 |
|
| 819 |
+
[[package]]
|
| 820 |
+
name = "flatbuffers"
|
| 821 |
+
version = "24.3.25"
|
| 822 |
+
description = "The FlatBuffers serialization format for Python"
|
| 823 |
+
optional = false
|
| 824 |
+
python-versions = "*"
|
| 825 |
+
files = [
|
| 826 |
+
{file = "flatbuffers-24.3.25-py2.py3-none-any.whl", hash = "sha256:8dbdec58f935f3765e4f7f3cf635ac3a77f83568138d6a2311f524ec96364812"},
|
| 827 |
+
{file = "flatbuffers-24.3.25.tar.gz", hash = "sha256:de2ec5b203f21441716617f38443e0a8ebf3d25bf0d9c0bb0ce68fa00ad546a4"},
|
| 828 |
+
]
|
| 829 |
+
|
| 830 |
[[package]]
|
| 831 |
name = "fqdn"
|
| 832 |
version = "1.5.1"
|
|
|
|
| 1160 |
torch = ["safetensors", "torch"]
|
| 1161 |
typing = ["types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)"]
|
| 1162 |
|
| 1163 |
+
[[package]]
|
| 1164 |
+
name = "humanfriendly"
|
| 1165 |
+
version = "10.0"
|
| 1166 |
+
description = "Human friendly output for text interfaces using Python"
|
| 1167 |
+
optional = false
|
| 1168 |
+
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
|
| 1169 |
+
files = [
|
| 1170 |
+
{file = "humanfriendly-10.0-py2.py3-none-any.whl", hash = "sha256:1697e1a8a8f550fd43c2865cd84542fc175a61dcb779b6fee18cf6b6ccba1477"},
|
| 1171 |
+
{file = "humanfriendly-10.0.tar.gz", hash = "sha256:6b0b831ce8f15f7300721aa49829fc4e83921a9a301cc7f606be6686a2288ddc"},
|
| 1172 |
+
]
|
| 1173 |
+
|
| 1174 |
+
[package.dependencies]
|
| 1175 |
+
pyreadline3 = {version = "*", markers = "sys_platform == \"win32\" and python_version >= \"3.8\""}
|
| 1176 |
+
|
| 1177 |
[[package]]
|
| 1178 |
name = "idna"
|
| 1179 |
version = "3.7"
|
|
|
|
| 1185 |
{file = "idna-3.7.tar.gz", hash = "sha256:028ff3aadf0609c1fd278d8ea3089299412a7a8b9bd005dd08b9f8285bcb5cfc"},
|
| 1186 |
]
|
| 1187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1188 |
[[package]]
|
| 1189 |
name = "intel-openmp"
|
| 1190 |
version = "2021.4.0"
|
|
|
|
| 1254 |
pygments = ">=2.4.0"
|
| 1255 |
stack-data = "*"
|
| 1256 |
traitlets = ">=5"
|
|
|
|
| 1257 |
|
| 1258 |
[package.extras]
|
| 1259 |
all = ["black", "curio", "docrepr", "exceptiongroup", "ipykernel", "ipyparallel", "ipywidgets", "matplotlib", "matplotlib (!=3.2.0)", "nbconvert", "nbformat", "notebook", "numpy (>=1.22)", "pandas", "pickleshare", "pytest (<7)", "pytest (<7.1)", "pytest-asyncio (<0.22)", "qtconsole", "setuptools (>=18.5)", "sphinx (>=1.3)", "sphinx-rtd-theme", "stack-data", "testpath", "trio", "typing-extensions"]
|
|
|
|
| 1447 |
]
|
| 1448 |
|
| 1449 |
[package.dependencies]
|
|
|
|
| 1450 |
jupyter-core = ">=4.12,<5.0.dev0 || >=5.1.dev0"
|
| 1451 |
python-dateutil = ">=2.8.2"
|
| 1452 |
pyzmq = ">=23.0"
|
|
|
|
| 1538 |
]
|
| 1539 |
|
| 1540 |
[package.dependencies]
|
|
|
|
| 1541 |
jupyter-server = ">=1.1.2"
|
| 1542 |
|
| 1543 |
[[package]]
|
|
|
|
| 1609 |
[package.dependencies]
|
| 1610 |
async-lru = ">=1.0.0"
|
| 1611 |
httpx = ">=0.25.0"
|
|
|
|
| 1612 |
ipykernel = ">=6.5.0"
|
| 1613 |
jinja2 = ">=3.0.3"
|
| 1614 |
jupyter-core = "*"
|
|
|
|
| 1653 |
|
| 1654 |
[package.dependencies]
|
| 1655 |
babel = ">=2.10"
|
|
|
|
| 1656 |
jinja2 = ">=3.0.3"
|
| 1657 |
json5 = ">=0.9.0"
|
| 1658 |
jsonschema = ">=4.18.0"
|
|
|
|
| 2017 |
beautifulsoup4 = "*"
|
| 2018 |
bleach = "!=5.0.0"
|
| 2019 |
defusedxml = "*"
|
|
|
|
| 2020 |
jinja2 = ">=3.0"
|
| 2021 |
jupyter-core = ">=4.7"
|
| 2022 |
jupyterlab-pygments = "*"
|
|
|
|
| 2301 |
optional = false
|
| 2302 |
python-versions = ">=3"
|
| 2303 |
files = [
|
| 2304 |
+
{file = "nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_aarch64.whl", hash = "sha256:98103729cc5226e13ca319a10bbf9433bbbd44ef64fe72f45f067cacc14b8d27"},
|
| 2305 |
{file = "nvidia_nvjitlink_cu12-12.5.82-py3-none-manylinux2014_x86_64.whl", hash = "sha256:f9b37bc5c8cf7509665cb6ada5aaa0ce65618f2332b7d3e78e9790511f111212"},
|
| 2306 |
{file = "nvidia_nvjitlink_cu12-12.5.82-py3-none-win_amd64.whl", hash = "sha256:e782564d705ff0bf61ac3e1bf730166da66dd2fe9012f111ede5fc49b64ae697"},
|
| 2307 |
]
|
|
|
|
| 2317 |
{file = "nvidia_nvtx_cu12-12.1.105-py3-none-win_amd64.whl", hash = "sha256:65f4d98982b31b60026e0e6de73fbdfc09d08a96f4656dd3665ca616a11e1e82"},
|
| 2318 |
]
|
| 2319 |
|
| 2320 |
+
[[package]]
|
| 2321 |
+
name = "onnxruntime"
|
| 2322 |
+
version = "1.19.2"
|
| 2323 |
+
description = "ONNX Runtime is a runtime accelerator for Machine Learning models"
|
| 2324 |
+
optional = false
|
| 2325 |
+
python-versions = "*"
|
| 2326 |
+
files = [
|
| 2327 |
+
{file = "onnxruntime-1.19.2-cp310-cp310-macosx_11_0_universal2.whl", hash = "sha256:84fa57369c06cadd3c2a538ae2a26d76d583e7c34bdecd5769d71ca5c0fc750e"},
|
| 2328 |
+
{file = "onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:bdc471a66df0c1cdef774accef69e9f2ca168c851ab5e4f2f3341512c7ef4666"},
|
| 2329 |
+
{file = "onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e3a4ce906105d99ebbe817f536d50a91ed8a4d1592553f49b3c23c4be2560ae6"},
|
| 2330 |
+
{file = "onnxruntime-1.19.2-cp310-cp310-win32.whl", hash = "sha256:4b3d723cc154c8ddeb9f6d0a8c0d6243774c6b5930847cc83170bfe4678fafb3"},
|
| 2331 |
+
{file = "onnxruntime-1.19.2-cp310-cp310-win_amd64.whl", hash = "sha256:17ed7382d2c58d4b7354fb2b301ff30b9bf308a1c7eac9546449cd122d21cae5"},
|
| 2332 |
+
{file = "onnxruntime-1.19.2-cp311-cp311-macosx_11_0_universal2.whl", hash = "sha256:d863e8acdc7232d705d49e41087e10b274c42f09e259016a46f32c34e06dc4fd"},
|
| 2333 |
+
{file = "onnxruntime-1.19.2-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c1dfe4f660a71b31caa81fc298a25f9612815215a47b286236e61d540350d7b6"},
|
| 2334 |
+
{file = "onnxruntime-1.19.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a36511dc07c5c964b916697e42e366fa43c48cdb3d3503578d78cef30417cb84"},
|
| 2335 |
+
{file = "onnxruntime-1.19.2-cp311-cp311-win32.whl", hash = "sha256:50cbb8dc69d6befad4746a69760e5b00cc3ff0a59c6c3fb27f8afa20e2cab7e7"},
|
| 2336 |
+
{file = "onnxruntime-1.19.2-cp311-cp311-win_amd64.whl", hash = "sha256:1c3e5d415b78337fa0b1b75291e9ea9fb2a4c1f148eb5811e7212fed02cfffa8"},
|
| 2337 |
+
{file = "onnxruntime-1.19.2-cp312-cp312-macosx_11_0_universal2.whl", hash = "sha256:68e7051bef9cfefcbb858d2d2646536829894d72a4130c24019219442b1dd2ed"},
|
| 2338 |
+
{file = "onnxruntime-1.19.2-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d2d366fbcc205ce68a8a3bde2185fd15c604d9645888703785b61ef174265168"},
|
| 2339 |
+
{file = "onnxruntime-1.19.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:477b93df4db467e9cbf34051662a4b27c18e131fa1836e05974eae0d6e4cf29b"},
|
| 2340 |
+
{file = "onnxruntime-1.19.2-cp312-cp312-win32.whl", hash = "sha256:9a174073dc5608fad05f7cf7f320b52e8035e73d80b0a23c80f840e5a97c0147"},
|
| 2341 |
+
{file = "onnxruntime-1.19.2-cp312-cp312-win_amd64.whl", hash = "sha256:190103273ea4507638ffc31d66a980594b237874b65379e273125150eb044857"},
|
| 2342 |
+
{file = "onnxruntime-1.19.2-cp38-cp38-macosx_11_0_universal2.whl", hash = "sha256:636bc1d4cc051d40bc52e1f9da87fbb9c57d9d47164695dfb1c41646ea51ea66"},
|
| 2343 |
+
{file = "onnxruntime-1.19.2-cp38-cp38-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5bd8b875757ea941cbcfe01582970cc299893d1b65bd56731e326a8333f638a3"},
|
| 2344 |
+
{file = "onnxruntime-1.19.2-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b2046fc9560f97947bbc1acbe4c6d48585ef0f12742744307d3364b131ac5778"},
|
| 2345 |
+
{file = "onnxruntime-1.19.2-cp38-cp38-win32.whl", hash = "sha256:31c12840b1cde4ac1f7d27d540c44e13e34f2345cf3642762d2a3333621abb6a"},
|
| 2346 |
+
{file = "onnxruntime-1.19.2-cp38-cp38-win_amd64.whl", hash = "sha256:016229660adea180e9a32ce218b95f8f84860a200f0f13b50070d7d90e92956c"},
|
| 2347 |
+
{file = "onnxruntime-1.19.2-cp39-cp39-macosx_11_0_universal2.whl", hash = "sha256:006c8d326835c017a9e9f74c9c77ebb570a71174a1e89fe078b29a557d9c3848"},
|
| 2348 |
+
{file = "onnxruntime-1.19.2-cp39-cp39-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:df2a94179a42d530b936f154615b54748239c2908ee44f0d722cb4df10670f68"},
|
| 2349 |
+
{file = "onnxruntime-1.19.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fae4b4de45894b9ce7ae418c5484cbf0341db6813effec01bb2216091c52f7fb"},
|
| 2350 |
+
{file = "onnxruntime-1.19.2-cp39-cp39-win32.whl", hash = "sha256:dc5430f473e8706fff837ae01323be9dcfddd3ea471c900a91fa7c9b807ec5d3"},
|
| 2351 |
+
{file = "onnxruntime-1.19.2-cp39-cp39-win_amd64.whl", hash = "sha256:38475e29a95c5f6c62c2c603d69fc7d4c6ccbf4df602bd567b86ae1138881c49"},
|
| 2352 |
+
]
|
| 2353 |
+
|
| 2354 |
+
[package.dependencies]
|
| 2355 |
+
coloredlogs = "*"
|
| 2356 |
+
flatbuffers = "*"
|
| 2357 |
+
numpy = ">=1.21.6"
|
| 2358 |
+
packaging = "*"
|
| 2359 |
+
protobuf = "*"
|
| 2360 |
+
sympy = "*"
|
| 2361 |
+
|
| 2362 |
[[package]]
|
| 2363 |
name = "opencv-python"
|
| 2364 |
version = "4.10.0.84"
|
|
|
|
| 2377 |
|
| 2378 |
[package.dependencies]
|
| 2379 |
numpy = [
|
| 2380 |
+
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 2381 |
{version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\" and python_version < \"3.11\""},
|
| 2382 |
{version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\" and python_version < \"3.11\""},
|
|
|
|
| 2383 |
{version = ">=1.23.5", markers = "python_version >= \"3.11\" and python_version < \"3.12\""},
|
|
|
|
| 2384 |
]
|
| 2385 |
|
| 2386 |
[[package]]
|
|
|
|
| 2445 |
|
| 2446 |
[package.dependencies]
|
| 2447 |
numpy = [
|
| 2448 |
+
{version = ">=1.26.0", markers = "python_version >= \"3.12\""},
|
| 2449 |
{version = ">=1.22.4", markers = "python_version < \"3.11\""},
|
| 2450 |
{version = ">=1.23.2", markers = "python_version == \"3.11\""},
|
|
|
|
| 2451 |
]
|
| 2452 |
python-dateutil = ">=2.8.2"
|
| 2453 |
pytz = ">=2020.1"
|
|
|
|
| 2506 |
|
| 2507 |
[[package]]
|
| 2508 |
name = "pdftext"
|
| 2509 |
+
version = "0.3.13"
|
| 2510 |
description = "Extract structured text from pdfs quickly"
|
| 2511 |
optional = false
|
| 2512 |
python-versions = "!=2.7.*,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,!=3.7.*,!=3.8.*,>=3.9"
|
| 2513 |
files = [
|
| 2514 |
+
{file = "pdftext-0.3.13-py3-none-any.whl", hash = "sha256:ae8f6876cdbbc1fe611527bb362cd3d584b4c8ec9370215560f2a01be4343bbc"},
|
| 2515 |
+
{file = "pdftext-0.3.13.tar.gz", hash = "sha256:a37ceb759ac0da34c48f85ab5d43d0b128ad9526f949e98b96568495c7be4187"},
|
| 2516 |
]
|
| 2517 |
|
| 2518 |
[package.dependencies]
|
| 2519 |
+
onnxruntime = ">=1.19.2,<2.0.0"
|
| 2520 |
pydantic = ">=2.7.1,<3.0.0"
|
| 2521 |
pydantic-settings = ">=2.2.1,<3.0.0"
|
| 2522 |
pypdfium2 = ">=4.29.0,<5.0.0"
|
|
|
|
| 2523 |
|
| 2524 |
[[package]]
|
| 2525 |
name = "pexpect"
|
|
|
|
| 2814 |
|
| 2815 |
[[package]]
|
| 2816 |
name = "pydantic"
|
| 2817 |
+
version = "2.9.2"
|
| 2818 |
description = "Data validation using Python type hints"
|
| 2819 |
optional = false
|
| 2820 |
python-versions = ">=3.8"
|
| 2821 |
files = [
|
| 2822 |
+
{file = "pydantic-2.9.2-py3-none-any.whl", hash = "sha256:f048cec7b26778210e28a0459867920654d48e5e62db0958433636cde4254f12"},
|
| 2823 |
+
{file = "pydantic-2.9.2.tar.gz", hash = "sha256:d155cef71265d1e9807ed1c32b4c8deec042a44a50a4188b25ac67ecd81a9c0f"},
|
| 2824 |
]
|
| 2825 |
|
| 2826 |
[package.dependencies]
|
| 2827 |
+
annotated-types = ">=0.6.0"
|
| 2828 |
+
pydantic-core = "2.23.4"
|
| 2829 |
+
typing-extensions = [
|
| 2830 |
+
{version = ">=4.12.2", markers = "python_version >= \"3.13\""},
|
| 2831 |
+
{version = ">=4.6.1", markers = "python_version < \"3.13\""},
|
| 2832 |
+
]
|
| 2833 |
|
| 2834 |
[package.extras]
|
| 2835 |
email = ["email-validator (>=2.0.0)"]
|
| 2836 |
+
timezone = ["tzdata"]
|
| 2837 |
|
| 2838 |
[[package]]
|
| 2839 |
name = "pydantic-core"
|
| 2840 |
+
version = "2.23.4"
|
| 2841 |
description = "Core functionality for Pydantic validation and serialization"
|
| 2842 |
optional = false
|
| 2843 |
python-versions = ">=3.8"
|
| 2844 |
files = [
|
| 2845 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:b10bd51f823d891193d4717448fab065733958bdb6a6b351967bd349d48d5c9b"},
|
| 2846 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:4fc714bdbfb534f94034efaa6eadd74e5b93c8fa6315565a222f7b6f42ca1166"},
|
| 2847 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:63e46b3169866bd62849936de036f901a9356e36376079b05efa83caeaa02ceb"},
|
| 2848 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ed1a53de42fbe34853ba90513cea21673481cd81ed1be739f7f2efb931b24916"},
|
| 2849 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:cfdd16ab5e59fc31b5e906d1a3f666571abc367598e3e02c83403acabc092e07"},
|
| 2850 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:255a8ef062cbf6674450e668482456abac99a5583bbafb73f9ad469540a3a232"},
|
| 2851 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4a7cd62e831afe623fbb7aabbb4fe583212115b3ef38a9f6b71869ba644624a2"},
|
| 2852 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:f09e2ff1f17c2b51f2bc76d1cc33da96298f0a036a137f5440ab3ec5360b624f"},
|
| 2853 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:e38e63e6f3d1cec5a27e0afe90a085af8b6806ee208b33030e65b6516353f1a3"},
|
| 2854 |
+
{file = "pydantic_core-2.23.4-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:0dbd8dbed2085ed23b5c04afa29d8fd2771674223135dc9bc937f3c09284d071"},
|
| 2855 |
+
{file = "pydantic_core-2.23.4-cp310-none-win32.whl", hash = "sha256:6531b7ca5f951d663c339002e91aaebda765ec7d61b7d1e3991051906ddde119"},
|
| 2856 |
+
{file = "pydantic_core-2.23.4-cp310-none-win_amd64.whl", hash = "sha256:7c9129eb40958b3d4500fa2467e6a83356b3b61bfff1b414c7361d9220f9ae8f"},
|
| 2857 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:77733e3892bb0a7fa797826361ce8a9184d25c8dffaec60b7ffe928153680ba8"},
|
| 2858 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:1b84d168f6c48fabd1f2027a3d1bdfe62f92cade1fb273a5d68e621da0e44e6d"},
|
| 2859 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:df49e7a0861a8c36d089c1ed57d308623d60416dab2647a4a17fe050ba85de0e"},
|
| 2860 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ff02b6d461a6de369f07ec15e465a88895f3223eb75073ffea56b84d9331f607"},
|
| 2861 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:996a38a83508c54c78a5f41456b0103c30508fed9abcad0a59b876d7398f25fd"},
|
| 2862 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d97683ddee4723ae8c95d1eddac7c192e8c552da0c73a925a89fa8649bf13eea"},
|
| 2863 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:216f9b2d7713eb98cb83c80b9c794de1f6b7e3145eef40400c62e86cee5f4e1e"},
|
| 2864 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:6f783e0ec4803c787bcea93e13e9932edab72068f68ecffdf86a99fd5918878b"},
|
| 2865 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:d0776dea117cf5272382634bd2a5c1b6eb16767c223c6a5317cd3e2a757c61a0"},
|
| 2866 |
+
{file = "pydantic_core-2.23.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:d5f7a395a8cf1621939692dba2a6b6a830efa6b3cee787d82c7de1ad2930de64"},
|
| 2867 |
+
{file = "pydantic_core-2.23.4-cp311-none-win32.whl", hash = "sha256:74b9127ffea03643e998e0c5ad9bd3811d3dac8c676e47db17b0ee7c3c3bf35f"},
|
| 2868 |
+
{file = "pydantic_core-2.23.4-cp311-none-win_amd64.whl", hash = "sha256:98d134c954828488b153d88ba1f34e14259284f256180ce659e8d83e9c05eaa3"},
|
| 2869 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:f3e0da4ebaef65158d4dfd7d3678aad692f7666877df0002b8a522cdf088f231"},
|
| 2870 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:f69a8e0b033b747bb3e36a44e7732f0c99f7edd5cea723d45bc0d6e95377ffee"},
|
| 2871 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:723314c1d51722ab28bfcd5240d858512ffd3116449c557a1336cbe3919beb87"},
|
| 2872 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:bb2802e667b7051a1bebbfe93684841cc9351004e2badbd6411bf357ab8d5ac8"},
|
| 2873 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d18ca8148bebe1b0a382a27a8ee60350091a6ddaf475fa05ef50dc35b5df6327"},
|
| 2874 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:33e3d65a85a2a4a0dc3b092b938a4062b1a05f3a9abde65ea93b233bca0e03f2"},
|
| 2875 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:128585782e5bfa515c590ccee4b727fb76925dd04a98864182b22e89a4e6ed36"},
|
| 2876 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:68665f4c17edcceecc112dfed5dbe6f92261fb9d6054b47d01bf6371a6196126"},
|
| 2877 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:20152074317d9bed6b7a95ade3b7d6054845d70584216160860425f4fbd5ee9e"},
|
| 2878 |
+
{file = "pydantic_core-2.23.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:9261d3ce84fa1d38ed649c3638feefeae23d32ba9182963e465d58d62203bd24"},
|
| 2879 |
+
{file = "pydantic_core-2.23.4-cp312-none-win32.whl", hash = "sha256:4ba762ed58e8d68657fc1281e9bb72e1c3e79cc5d464be146e260c541ec12d84"},
|
| 2880 |
+
{file = "pydantic_core-2.23.4-cp312-none-win_amd64.whl", hash = "sha256:97df63000f4fea395b2824da80e169731088656d1818a11b95f3b173747b6cd9"},
|
| 2881 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:7530e201d10d7d14abce4fb54cfe5b94a0aefc87da539d0346a484ead376c3cc"},
|
| 2882 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:df933278128ea1cd77772673c73954e53a1c95a4fdf41eef97c2b779271bd0bd"},
|
| 2883 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0cb3da3fd1b6a5d0279a01877713dbda118a2a4fc6f0d821a57da2e464793f05"},
|
| 2884 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:42c6dcb030aefb668a2b7009c85b27f90e51e6a3b4d5c9bc4c57631292015b0d"},
|
| 2885 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:696dd8d674d6ce621ab9d45b205df149399e4bb9aa34102c970b721554828510"},
|
| 2886 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2971bb5ffe72cc0f555c13e19b23c85b654dd2a8f7ab493c262071377bfce9f6"},
|
| 2887 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8394d940e5d400d04cad4f75c0598665cbb81aecefaca82ca85bd28264af7f9b"},
|
| 2888 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:0dff76e0602ca7d4cdaacc1ac4c005e0ce0dcfe095d5b5259163a80d3a10d327"},
|
| 2889 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:7d32706badfe136888bdea71c0def994644e09fff0bfe47441deaed8e96fdbc6"},
|
| 2890 |
+
{file = "pydantic_core-2.23.4-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:ed541d70698978a20eb63d8c5d72f2cc6d7079d9d90f6b50bad07826f1320f5f"},
|
| 2891 |
+
{file = "pydantic_core-2.23.4-cp313-none-win32.whl", hash = "sha256:3d5639516376dce1940ea36edf408c554475369f5da2abd45d44621cb616f769"},
|
| 2892 |
+
{file = "pydantic_core-2.23.4-cp313-none-win_amd64.whl", hash = "sha256:5a1504ad17ba4210df3a045132a7baeeba5a200e930f57512ee02909fc5c4cb5"},
|
| 2893 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:d4488a93b071c04dc20f5cecc3631fc78b9789dd72483ba15d423b5b3689b555"},
|
| 2894 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:81965a16b675b35e1d09dd14df53f190f9129c0202356ed44ab2728b1c905658"},
|
| 2895 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4ffa2ebd4c8530079140dd2d7f794a9d9a73cbb8e9d59ffe24c63436efa8f271"},
|
| 2896 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:61817945f2fe7d166e75fbfb28004034b48e44878177fc54d81688e7b85a3665"},
|
| 2897 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:29d2c342c4bc01b88402d60189f3df065fb0dda3654744d5a165a5288a657368"},
|
| 2898 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5e11661ce0fd30a6790e8bcdf263b9ec5988e95e63cf901972107efc49218b13"},
|
| 2899 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9d18368b137c6295db49ce7218b1a9ba15c5bc254c96d7c9f9e924a9bc7825ad"},
|
| 2900 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ec4e55f79b1c4ffb2eecd8a0cfba9955a2588497d96851f4c8f99aa4a1d39b12"},
|
| 2901 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:374a5e5049eda9e0a44c696c7ade3ff355f06b1fe0bb945ea3cac2bc336478a2"},
|
| 2902 |
+
{file = "pydantic_core-2.23.4-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:5c364564d17da23db1106787675fc7af45f2f7b58b4173bfdd105564e132e6fb"},
|
| 2903 |
+
{file = "pydantic_core-2.23.4-cp38-none-win32.whl", hash = "sha256:d7a80d21d613eec45e3d41eb22f8f94ddc758a6c4720842dc74c0581f54993d6"},
|
| 2904 |
+
{file = "pydantic_core-2.23.4-cp38-none-win_amd64.whl", hash = "sha256:5f5ff8d839f4566a474a969508fe1c5e59c31c80d9e140566f9a37bba7b8d556"},
|
| 2905 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:a4fa4fc04dff799089689f4fd502ce7d59de529fc2f40a2c8836886c03e0175a"},
|
| 2906 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:0a7df63886be5e270da67e0966cf4afbae86069501d35c8c1b3b6c168f42cb36"},
|
| 2907 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:dcedcd19a557e182628afa1d553c3895a9f825b936415d0dbd3cd0bbcfd29b4b"},
|
| 2908 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5f54b118ce5de9ac21c363d9b3caa6c800341e8c47a508787e5868c6b79c9323"},
|
| 2909 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:86d2f57d3e1379a9525c5ab067b27dbb8a0642fb5d454e17a9ac434f9ce523e3"},
|
| 2910 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:de6d1d1b9e5101508cb37ab0d972357cac5235f5c6533d1071964c47139257df"},
|
| 2911 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1278e0d324f6908e872730c9102b0112477a7f7cf88b308e4fc36ce1bdb6d58c"},
|
| 2912 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:9a6b5099eeec78827553827f4c6b8615978bb4b6a88e5d9b93eddf8bb6790f55"},
|
| 2913 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:e55541f756f9b3ee346b840103f32779c695a19826a4c442b7954550a0972040"},
|
| 2914 |
+
{file = "pydantic_core-2.23.4-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:a5c7ba8ffb6d6f8f2ab08743be203654bb1aaa8c9dcb09f82ddd34eadb695605"},
|
| 2915 |
+
{file = "pydantic_core-2.23.4-cp39-none-win32.whl", hash = "sha256:37b0fe330e4a58d3c58b24d91d1eb102aeec675a3db4c292ec3928ecd892a9a6"},
|
| 2916 |
+
{file = "pydantic_core-2.23.4-cp39-none-win_amd64.whl", hash = "sha256:1498bec4c05c9c787bde9125cfdcc63a41004ff167f495063191b863399b1a29"},
|
| 2917 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:f455ee30a9d61d3e1a15abd5068827773d6e4dc513e795f380cdd59932c782d5"},
|
| 2918 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:1e90d2e3bd2c3863d48525d297cd143fe541be8bbf6f579504b9712cb6b643ec"},
|
| 2919 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2e203fdf807ac7e12ab59ca2bfcabb38c7cf0b33c41efeb00f8e5da1d86af480"},
|
| 2920 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e08277a400de01bc72436a0ccd02bdf596631411f592ad985dcee21445bd0068"},
|
| 2921 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:f220b0eea5965dec25480b6333c788fb72ce5f9129e8759ef876a1d805d00801"},
|
| 2922 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:d06b0c8da4f16d1d1e352134427cb194a0a6e19ad5db9161bf32b2113409e728"},
|
| 2923 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:ba1a0996f6c2773bd83e63f18914c1de3c9dd26d55f4ac302a7efe93fb8e7433"},
|
| 2924 |
+
{file = "pydantic_core-2.23.4-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:9a5bce9d23aac8f0cf0836ecfc033896aa8443b501c58d0602dbfd5bd5b37753"},
|
| 2925 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:78ddaaa81421a29574a682b3179d4cf9e6d405a09b99d93ddcf7e5239c742e21"},
|
| 2926 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:883a91b5dd7d26492ff2f04f40fbb652de40fcc0afe07e8129e8ae779c2110eb"},
|
| 2927 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:88ad334a15b32a791ea935af224b9de1bf99bcd62fabf745d5f3442199d86d59"},
|
| 2928 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:233710f069d251feb12a56da21e14cca67994eab08362207785cf8c598e74577"},
|
| 2929 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:19442362866a753485ba5e4be408964644dd6a09123d9416c54cd49171f50744"},
|
| 2930 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:624e278a7d29b6445e4e813af92af37820fafb6dcc55c012c834f9e26f9aaaef"},
|
| 2931 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:f5ef8f42bec47f21d07668a043f077d507e5bf4e668d5c6dfe6aaba89de1a5b8"},
|
| 2932 |
+
{file = "pydantic_core-2.23.4-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:aea443fffa9fbe3af1a9ba721a87f926fe548d32cab71d188a6ede77d0ff244e"},
|
| 2933 |
+
{file = "pydantic_core-2.23.4.tar.gz", hash = "sha256:2584f7cf844ac4d970fba483a717dbe10c1c1c96a969bf65d61ffe94df1b2863"},
|
| 2934 |
]
|
| 2935 |
|
| 2936 |
[package.dependencies]
|
|
|
|
| 2938 |
|
| 2939 |
[[package]]
|
| 2940 |
name = "pydantic-settings"
|
| 2941 |
+
version = "2.5.2"
|
| 2942 |
description = "Settings management using Pydantic"
|
| 2943 |
optional = false
|
| 2944 |
python-versions = ">=3.8"
|
| 2945 |
files = [
|
| 2946 |
+
{file = "pydantic_settings-2.5.2-py3-none-any.whl", hash = "sha256:2c912e55fd5794a59bf8c832b9de832dcfdf4778d79ff79b708744eed499a907"},
|
| 2947 |
+
{file = "pydantic_settings-2.5.2.tar.gz", hash = "sha256:f90b139682bee4d2065273d5185d71d37ea46cfe57e1b5ae184fc6a0b2484ca0"},
|
| 2948 |
]
|
| 2949 |
|
| 2950 |
[package.dependencies]
|
|
|
|
| 3011 |
{file = "pypdfium2-4.30.0.tar.gz", hash = "sha256:48b5b7e5566665bc1015b9d69c1ebabe21f6aee468b509531c3c8318eeee2e16"},
|
| 3012 |
]
|
| 3013 |
|
| 3014 |
+
[[package]]
|
| 3015 |
+
name = "pyreadline3"
|
| 3016 |
+
version = "3.5.4"
|
| 3017 |
+
description = "A python implementation of GNU readline."
|
| 3018 |
+
optional = false
|
| 3019 |
+
python-versions = ">=3.8"
|
| 3020 |
+
files = [
|
| 3021 |
+
{file = "pyreadline3-3.5.4-py3-none-any.whl", hash = "sha256:eaf8e6cc3c49bcccf145fc6067ba8643d1df34d604a1ec0eccbf7a18e6d3fae6"},
|
| 3022 |
+
{file = "pyreadline3-3.5.4.tar.gz", hash = "sha256:8d57d53039a1c75adba8e50dd3d992b28143480816187ea5efbd5c78e6c885b7"},
|
| 3023 |
+
]
|
| 3024 |
+
|
| 3025 |
+
[package.extras]
|
| 3026 |
+
dev = ["build", "flake8", "mypy", "pytest", "twine"]
|
| 3027 |
+
|
| 3028 |
[[package]]
|
| 3029 |
name = "python-dateutil"
|
| 3030 |
version = "2.9.0.post0"
|
|
|
|
| 3819 |
|
| 3820 |
[[package]]
|
| 3821 |
name = "scikit-learn"
|
| 3822 |
+
version = "1.5.2"
|
| 3823 |
description = "A set of python modules for machine learning and data mining"
|
| 3824 |
optional = false
|
| 3825 |
python-versions = ">=3.9"
|
| 3826 |
files = [
|
| 3827 |
+
{file = "scikit_learn-1.5.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:299406827fb9a4f862626d0fe6c122f5f87f8910b86fe5daa4c32dcd742139b6"},
|
| 3828 |
+
{file = "scikit_learn-1.5.2-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:2d4cad1119c77930b235579ad0dc25e65c917e756fe80cab96aa3b9428bd3fb0"},
|
| 3829 |
+
{file = "scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8c412ccc2ad9bf3755915e3908e677b367ebc8d010acbb3f182814524f2e5540"},
|
| 3830 |
+
{file = "scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3a686885a4b3818d9e62904d91b57fa757fc2bed3e465c8b177be652f4dd37c8"},
|
| 3831 |
+
{file = "scikit_learn-1.5.2-cp310-cp310-win_amd64.whl", hash = "sha256:c15b1ca23d7c5f33cc2cb0a0d6aaacf893792271cddff0edbd6a40e8319bc113"},
|
| 3832 |
+
{file = "scikit_learn-1.5.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:03b6158efa3faaf1feea3faa884c840ebd61b6484167c711548fce208ea09445"},
|
| 3833 |
+
{file = "scikit_learn-1.5.2-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:1ff45e26928d3b4eb767a8f14a9a6efbf1cbff7c05d1fb0f95f211a89fd4f5de"},
|
| 3834 |
+
{file = "scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f763897fe92d0e903aa4847b0aec0e68cadfff77e8a0687cabd946c89d17e675"},
|
| 3835 |
+
{file = "scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f8b0ccd4a902836493e026c03256e8b206656f91fbcc4fde28c57a5b752561f1"},
|
| 3836 |
+
{file = "scikit_learn-1.5.2-cp311-cp311-win_amd64.whl", hash = "sha256:6c16d84a0d45e4894832b3c4d0bf73050939e21b99b01b6fd59cbb0cf39163b6"},
|
| 3837 |
+
{file = "scikit_learn-1.5.2-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:f932a02c3f4956dfb981391ab24bda1dbd90fe3d628e4b42caef3e041c67707a"},
|
| 3838 |
+
{file = "scikit_learn-1.5.2-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:3b923d119d65b7bd555c73be5423bf06c0105678ce7e1f558cb4b40b0a5502b1"},
|
| 3839 |
+
{file = "scikit_learn-1.5.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f60021ec1574e56632be2a36b946f8143bf4e5e6af4a06d85281adc22938e0dd"},
|
| 3840 |
+
{file = "scikit_learn-1.5.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:394397841449853c2290a32050382edaec3da89e35b3e03d6cc966aebc6a8ae6"},
|
| 3841 |
+
{file = "scikit_learn-1.5.2-cp312-cp312-win_amd64.whl", hash = "sha256:57cc1786cfd6bd118220a92ede80270132aa353647684efa385a74244a41e3b1"},
|
| 3842 |
+
{file = "scikit_learn-1.5.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e9a702e2de732bbb20d3bad29ebd77fc05a6b427dc49964300340e4c9328b3f5"},
|
| 3843 |
+
{file = "scikit_learn-1.5.2-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:b0768ad641981f5d3a198430a1d31c3e044ed2e8a6f22166b4d546a5116d7908"},
|
| 3844 |
+
{file = "scikit_learn-1.5.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:178ddd0a5cb0044464fc1bfc4cca5b1833bfc7bb022d70b05db8530da4bb3dd3"},
|
| 3845 |
+
{file = "scikit_learn-1.5.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f7284ade780084d94505632241bf78c44ab3b6f1e8ccab3d2af58e0e950f9c12"},
|
| 3846 |
+
{file = "scikit_learn-1.5.2-cp313-cp313-win_amd64.whl", hash = "sha256:b7b0f9a0b1040830d38c39b91b3a44e1b643f4b36e36567b80b7c6bd2202a27f"},
|
| 3847 |
+
{file = "scikit_learn-1.5.2-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:757c7d514ddb00ae249832fe87100d9c73c6ea91423802872d9e74970a0e40b9"},
|
| 3848 |
+
{file = "scikit_learn-1.5.2-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:52788f48b5d8bca5c0736c175fa6bdaab2ef00a8f536cda698db61bd89c551c1"},
|
| 3849 |
+
{file = "scikit_learn-1.5.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:643964678f4b5fbdc95cbf8aec638acc7aa70f5f79ee2cdad1eec3df4ba6ead8"},
|
| 3850 |
+
{file = "scikit_learn-1.5.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ca64b3089a6d9b9363cd3546f8978229dcbb737aceb2c12144ee3f70f95684b7"},
|
| 3851 |
+
{file = "scikit_learn-1.5.2-cp39-cp39-win_amd64.whl", hash = "sha256:3bed4909ba187aca80580fe2ef370d9180dcf18e621a27c4cf2ef10d279a7efe"},
|
| 3852 |
+
{file = "scikit_learn-1.5.2.tar.gz", hash = "sha256:b4237ed7b3fdd0a4882792e68ef2545d5baa50aca3bb45aa7df468138ad8f94d"},
|
| 3853 |
]
|
| 3854 |
|
| 3855 |
[package.dependencies]
|
| 3856 |
joblib = ">=1.2.0"
|
| 3857 |
numpy = ">=1.19.5"
|
| 3858 |
scipy = ">=1.6.0"
|
| 3859 |
+
threadpoolctl = ">=3.1.0"
|
| 3860 |
|
| 3861 |
[package.extras]
|
| 3862 |
+
benchmark = ["matplotlib (>=3.3.4)", "memory_profiler (>=0.57.0)", "pandas (>=1.1.5)"]
|
| 3863 |
+
build = ["cython (>=3.0.10)", "meson-python (>=0.16.0)", "numpy (>=1.19.5)", "scipy (>=1.6.0)"]
|
| 3864 |
+
docs = ["Pillow (>=7.1.2)", "matplotlib (>=3.3.4)", "memory_profiler (>=0.57.0)", "numpydoc (>=1.2.0)", "pandas (>=1.1.5)", "plotly (>=5.14.0)", "polars (>=0.20.30)", "pooch (>=1.6.0)", "pydata-sphinx-theme (>=0.15.3)", "scikit-image (>=0.17.2)", "seaborn (>=0.9.0)", "sphinx (>=7.3.7)", "sphinx-copybutton (>=0.5.2)", "sphinx-design (>=0.5.0)", "sphinx-design (>=0.6.0)", "sphinx-gallery (>=0.16.0)", "sphinx-prompt (>=1.4.0)", "sphinx-remove-toctrees (>=1.0.0.post1)", "sphinxcontrib-sass (>=0.3.4)", "sphinxext-opengraph (>=0.9.1)"]
|
| 3865 |
examples = ["matplotlib (>=3.3.4)", "pandas (>=1.1.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.17.2)", "seaborn (>=0.9.0)"]
|
| 3866 |
+
install = ["joblib (>=1.2.0)", "numpy (>=1.19.5)", "scipy (>=1.6.0)", "threadpoolctl (>=3.1.0)"]
|
| 3867 |
+
maintenance = ["conda-lock (==2.5.6)"]
|
| 3868 |
+
tests = ["black (>=24.3.0)", "matplotlib (>=3.3.4)", "mypy (>=1.9)", "numpydoc (>=1.2.0)", "pandas (>=1.1.5)", "polars (>=0.20.30)", "pooch (>=1.6.0)", "pyamg (>=4.0.0)", "pyarrow (>=12.0.0)", "pytest (>=7.1.2)", "pytest-cov (>=2.9.0)", "ruff (>=0.2.1)", "scikit-image (>=0.17.2)"]
|
| 3869 |
|
| 3870 |
[[package]]
|
| 3871 |
name = "scipy"
|
| 3872 |
+
version = "1.14.1"
|
| 3873 |
description = "Fundamental algorithms for scientific computing in Python"
|
| 3874 |
optional = false
|
| 3875 |
+
python-versions = ">=3.10"
|
| 3876 |
+
files = [
|
| 3877 |
+
{file = "scipy-1.14.1-cp310-cp310-macosx_10_13_x86_64.whl", hash = "sha256:b28d2ca4add7ac16ae8bb6632a3c86e4b9e4d52d3e34267f6e1b0c1f8d87e389"},
|
| 3878 |
+
{file = "scipy-1.14.1-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:d0d2821003174de06b69e58cef2316a6622b60ee613121199cb2852a873f8cf3"},
|
| 3879 |
+
{file = "scipy-1.14.1-cp310-cp310-macosx_14_0_arm64.whl", hash = "sha256:8bddf15838ba768bb5f5083c1ea012d64c9a444e16192762bd858f1e126196d0"},
|
| 3880 |
+
{file = "scipy-1.14.1-cp310-cp310-macosx_14_0_x86_64.whl", hash = "sha256:97c5dddd5932bd2a1a31c927ba5e1463a53b87ca96b5c9bdf5dfd6096e27efc3"},
|
| 3881 |
+
{file = "scipy-1.14.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2ff0a7e01e422c15739ecd64432743cf7aae2b03f3084288f399affcefe5222d"},
|
| 3882 |
+
{file = "scipy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8e32dced201274bf96899e6491d9ba3e9a5f6b336708656466ad0522d8528f69"},
|
| 3883 |
+
{file = "scipy-1.14.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:8426251ad1e4ad903a4514712d2fa8fdd5382c978010d1c6f5f37ef286a713ad"},
|
| 3884 |
+
{file = "scipy-1.14.1-cp310-cp310-win_amd64.whl", hash = "sha256:a49f6ed96f83966f576b33a44257d869756df6cf1ef4934f59dd58b25e0327e5"},
|
| 3885 |
+
{file = "scipy-1.14.1-cp311-cp311-macosx_10_13_x86_64.whl", hash = "sha256:2da0469a4ef0ecd3693761acbdc20f2fdeafb69e6819cc081308cc978153c675"},
|
| 3886 |
+
{file = "scipy-1.14.1-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:c0ee987efa6737242745f347835da2cc5bb9f1b42996a4d97d5c7ff7928cb6f2"},
|
| 3887 |
+
{file = "scipy-1.14.1-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:3a1b111fac6baec1c1d92f27e76511c9e7218f1695d61b59e05e0fe04dc59617"},
|
| 3888 |
+
{file = "scipy-1.14.1-cp311-cp311-macosx_14_0_x86_64.whl", hash = "sha256:8475230e55549ab3f207bff11ebfc91c805dc3463ef62eda3ccf593254524ce8"},
|
| 3889 |
+
{file = "scipy-1.14.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:278266012eb69f4a720827bdd2dc54b2271c97d84255b2faaa8f161a158c3b37"},
|
| 3890 |
+
{file = "scipy-1.14.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fef8c87f8abfb884dac04e97824b61299880c43f4ce675dd2cbeadd3c9b466d2"},
|
| 3891 |
+
{file = "scipy-1.14.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:b05d43735bb2f07d689f56f7b474788a13ed8adc484a85aa65c0fd931cf9ccd2"},
|
| 3892 |
+
{file = "scipy-1.14.1-cp311-cp311-win_amd64.whl", hash = "sha256:716e389b694c4bb564b4fc0c51bc84d381735e0d39d3f26ec1af2556ec6aad94"},
|
| 3893 |
+
{file = "scipy-1.14.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:631f07b3734d34aced009aaf6fedfd0eb3498a97e581c3b1e5f14a04164a456d"},
|
| 3894 |
+
{file = "scipy-1.14.1-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:af29a935803cc707ab2ed7791c44288a682f9c8107bc00f0eccc4f92c08d6e07"},
|
| 3895 |
+
{file = "scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:2843f2d527d9eebec9a43e6b406fb7266f3af25a751aa91d62ff416f54170bc5"},
|
| 3896 |
+
{file = "scipy-1.14.1-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:eb58ca0abd96911932f688528977858681a59d61a7ce908ffd355957f7025cfc"},
|
| 3897 |
+
{file = "scipy-1.14.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:30ac8812c1d2aab7131a79ba62933a2a76f582d5dbbc695192453dae67ad6310"},
|
| 3898 |
+
{file = "scipy-1.14.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8f9ea80f2e65bdaa0b7627fb00cbeb2daf163caa015e59b7516395fe3bd1e066"},
|
| 3899 |
+
{file = "scipy-1.14.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:edaf02b82cd7639db00dbff629995ef185c8df4c3ffa71a5562a595765a06ce1"},
|
| 3900 |
+
{file = "scipy-1.14.1-cp312-cp312-win_amd64.whl", hash = "sha256:2ff38e22128e6c03ff73b6bb0f85f897d2362f8c052e3b8ad00532198fbdae3f"},
|
| 3901 |
+
{file = "scipy-1.14.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:1729560c906963fc8389f6aac023739ff3983e727b1a4d87696b7bf108316a79"},
|
| 3902 |
+
{file = "scipy-1.14.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:4079b90df244709e675cdc8b93bfd8a395d59af40b72e339c2287c91860deb8e"},
|
| 3903 |
+
{file = "scipy-1.14.1-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:e0cf28db0f24a38b2a0ca33a85a54852586e43cf6fd876365c86e0657cfe7d73"},
|
| 3904 |
+
{file = "scipy-1.14.1-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:0c2f95de3b04e26f5f3ad5bb05e74ba7f68b837133a4492414b3afd79dfe540e"},
|
| 3905 |
+
{file = "scipy-1.14.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b99722ea48b7ea25e8e015e8341ae74624f72e5f21fc2abd45f3a93266de4c5d"},
|
| 3906 |
+
{file = "scipy-1.14.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5149e3fd2d686e42144a093b206aef01932a0059c2a33ddfa67f5f035bdfe13e"},
|
| 3907 |
+
{file = "scipy-1.14.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e4f5a7c49323533f9103d4dacf4e4f07078f360743dec7f7596949149efeec06"},
|
| 3908 |
+
{file = "scipy-1.14.1-cp313-cp313-win_amd64.whl", hash = "sha256:baff393942b550823bfce952bb62270ee17504d02a1801d7fd0719534dfb9c84"},
|
| 3909 |
+
{file = "scipy-1.14.1.tar.gz", hash = "sha256:5a275584e726026a5699459aa72f828a610821006228e841b94275c4a7c08417"},
|
| 3910 |
]
|
| 3911 |
|
| 3912 |
[package.dependencies]
|
| 3913 |
+
numpy = ">=1.23.5,<2.3"
|
| 3914 |
|
| 3915 |
[package.extras]
|
| 3916 |
+
dev = ["cython-lint (>=0.12.2)", "doit (>=0.36.0)", "mypy (==1.10.0)", "pycodestyle", "pydevtool", "rich-click", "ruff (>=0.0.292)", "types-psutil", "typing_extensions"]
|
| 3917 |
+
doc = ["jupyterlite-pyodide-kernel", "jupyterlite-sphinx (>=0.13.1)", "jupytext", "matplotlib (>=3.5)", "myst-nb", "numpydoc", "pooch", "pydata-sphinx-theme (>=0.15.2)", "sphinx (>=5.0.0,<=7.3.7)", "sphinx-design (>=0.4.0)"]
|
| 3918 |
+
test = ["Cython", "array-api-strict (>=2.0)", "asv", "gmpy2", "hypothesis (>=6.30)", "meson", "mpmath", "ninja", "pooch", "pytest", "pytest-cov", "pytest-timeout", "pytest-xdist", "scikit-umfpack", "threadpoolctl"]
|
| 3919 |
|
| 3920 |
[[package]]
|
| 3921 |
name = "send2trash"
|
|
|
|
| 4048 |
|
| 4049 |
[[package]]
|
| 4050 |
name = "surya-ocr"
|
| 4051 |
+
version = "0.6.3"
|
| 4052 |
+
description = "OCR, layout, reading order, and table recognition in 90+ languages"
|
| 4053 |
optional = false
|
| 4054 |
+
python-versions = ">=3.10"
|
| 4055 |
files = [
|
| 4056 |
+
{file = "surya_ocr-0.6.3-py3-none-any.whl", hash = "sha256:f4d98e643ed6003a1fed2a758bed391ffc7be908c849d3ab741b05c4d6a714a2"},
|
| 4057 |
+
{file = "surya_ocr-0.6.3.tar.gz", hash = "sha256:cf0e9382352eaf96ff74fe0ca5daff30f96f0897bb481ff418a8ae1a7ce31534"},
|
| 4058 |
]
|
| 4059 |
|
| 4060 |
[package.dependencies]
|
| 4061 |
filetype = ">=1.2.0,<2.0.0"
|
| 4062 |
ftfy = ">=6.1.3,<7.0.0"
|
| 4063 |
opencv-python = ">=4.9.0.80,<5.0.0.0"
|
| 4064 |
+
pdftext = ">=0.3.12,<0.4.0"
|
| 4065 |
pillow = ">=10.2.0,<11.0.0"
|
| 4066 |
pydantic = ">=2.5.3,<3.0.0"
|
| 4067 |
pydantic-settings = ">=2.1.0,<3.0.0"
|
|
|
|
| 4088 |
[package.extras]
|
| 4089 |
dev = ["hypothesis (>=6.70.0)", "pytest (>=7.1.0)"]
|
| 4090 |
|
| 4091 |
+
[[package]]
|
| 4092 |
+
name = "tabled-pdf"
|
| 4093 |
+
version = "0.1.0"
|
| 4094 |
+
description = "Detect and recognize tables in PDFs and images."
|
| 4095 |
+
optional = false
|
| 4096 |
+
python-versions = "<4.0,>=3.10"
|
| 4097 |
+
files = [
|
| 4098 |
+
{file = "tabled_pdf-0.1.0-py3-none-any.whl", hash = "sha256:95e3e5863cfbe829c9f233e3e9dc31be8c5f24ffd2367f57e983e710aeee659e"},
|
| 4099 |
+
{file = "tabled_pdf-0.1.0.tar.gz", hash = "sha256:63a2c7d3ae55b3e7e467c2fbad9d78c7c57e31810324fc584cbf322e8e026890"},
|
| 4100 |
+
]
|
| 4101 |
+
|
| 4102 |
+
[package.dependencies]
|
| 4103 |
+
click = ">=8.1.7,<9.0.0"
|
| 4104 |
+
pydantic = ">=2.9.2,<3.0.0"
|
| 4105 |
+
pydantic-settings = ">=2.5.2,<3.0.0"
|
| 4106 |
+
pypdfium2 = ">=4.30.0,<5.0.0"
|
| 4107 |
+
python-dotenv = ">=1.0.1,<2.0.0"
|
| 4108 |
+
scikit-learn = ">=1.5.2,<2.0.0"
|
| 4109 |
+
surya-ocr = ">=0.6.3,<0.7.0"
|
| 4110 |
+
tabulate = ">=0.9.0,<0.10.0"
|
| 4111 |
+
|
| 4112 |
[[package]]
|
| 4113 |
name = "tabulate"
|
| 4114 |
version = "0.9.0"
|
|
|
|
| 4972 |
idna = ">=2.0"
|
| 4973 |
multidict = ">=4.0"
|
| 4974 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4975 |
[metadata]
|
| 4976 |
lock-version = "2.0"
|
| 4977 |
+
python-versions = "^3.10"
|
| 4978 |
+
content-hash = "887985e53de36c13b8f82a96b1a93fea4ca6762db31bdcf9aa8147572c8a4771"
|
pyproject.toml
CHANGED
|
@@ -20,25 +20,23 @@ include = [
|
|
| 20 |
]
|
| 21 |
|
| 22 |
[tool.poetry.dependencies]
|
| 23 |
-
python = "
|
| 24 |
-
scikit-learn = "^1.3.2,<=1.4.2"
|
| 25 |
Pillow = "^10.1.0"
|
| 26 |
pydantic = "^2.4.2"
|
| 27 |
pydantic-settings = "^2.0.3"
|
| 28 |
-
transformers = "^4.
|
| 29 |
-
numpy = "^1.26.1"
|
| 30 |
python-dotenv = "^1.0.0"
|
| 31 |
-
torch = "^2.
|
| 32 |
tqdm = "^4.66.1"
|
| 33 |
tabulate = "^0.9.0"
|
| 34 |
ftfy = "^6.1.1"
|
| 35 |
-
texify = "^0.
|
| 36 |
rapidfuzz = "^3.8.1"
|
| 37 |
-
surya-ocr = "^0.
|
| 38 |
filetype = "^1.2.0"
|
| 39 |
regex = "^2024.4.28"
|
| 40 |
-
pdftext = "^0.3.
|
| 41 |
-
|
| 42 |
|
| 43 |
[tool.poetry.group.dev.dependencies]
|
| 44 |
jupyter = "^1.0.0"
|
|
|
|
| 20 |
]
|
| 21 |
|
| 22 |
[tool.poetry.dependencies]
|
| 23 |
+
python = "^3.10"
|
|
|
|
| 24 |
Pillow = "^10.1.0"
|
| 25 |
pydantic = "^2.4.2"
|
| 26 |
pydantic-settings = "^2.0.3"
|
| 27 |
+
transformers = "^4.45.2"
|
|
|
|
| 28 |
python-dotenv = "^1.0.0"
|
| 29 |
+
torch = "^2.4.1"
|
| 30 |
tqdm = "^4.66.1"
|
| 31 |
tabulate = "^0.9.0"
|
| 32 |
ftfy = "^6.1.1"
|
| 33 |
+
texify = "^0.2.0"
|
| 34 |
rapidfuzz = "^3.8.1"
|
| 35 |
+
surya-ocr = "^0.6.3"
|
| 36 |
filetype = "^1.2.0"
|
| 37 |
regex = "^2024.4.28"
|
| 38 |
+
pdftext = "^0.3.13"
|
| 39 |
+
tabled-pdf = "^0.1.0"
|
| 40 |
|
| 41 |
[tool.poetry.group.dev.dependencies]
|
| 42 |
jupyter = "^1.0.0"
|