Vik Paruchuri
commited on
Commit
·
5471d0c
1
Parent(s):
68eee70
Factor out llm services, enable local models
Browse files- README.md +13 -3
- marker/builders/llm_layout.py +5 -5
- marker/config/crawler.py +2 -1
- marker/config/parser.py +17 -1
- marker/converters/__init__.py +3 -1
- marker/converters/pdf.py +19 -1
- marker/processors/llm/__init__.py +10 -5
- marker/processors/llm/llm_meta.py +5 -4
- marker/processors/llm/llm_table.py +1 -1
- marker/processors/llm/llm_table_merge.py +1 -1
- marker/scripts/convert.py +2 -1
- marker/scripts/convert_single.py +2 -1
- marker/scripts/server.py +2 -1
- marker/scripts/streamlit_app.py +2 -1
- marker/services/__init__.py +26 -0
- marker/services/{google.py → gemini.py} +25 -18
- marker/services/ollama.py +71 -0
- marker/services/vertex.py +23 -0
- marker/util.py +14 -2
- tests/conftest.py +12 -0
- tests/processors/test_inline_math.py +3 -4
- tests/processors/test_llm_processors.py +24 -26
- tests/processors/test_table_merge.py +2 -3
README.md
CHANGED
|
@@ -22,7 +22,7 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc
|
|
| 22 |
|
| 23 |
## Hybrid Mode
|
| 24 |
|
| 25 |
-
For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker. This will do things like merge tables across pages, format tables properly, and extract values from forms. It
|
| 26 |
|
| 27 |
Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
|
| 28 |
|
|
@@ -42,7 +42,7 @@ As you can see, the use_llm mode offers higher accuracy than marker or gemini al
|
|
| 42 |
|
| 43 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
| 44 |
|
| 45 |
-
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under
|
| 46 |
|
| 47 |
# Hosted API
|
| 48 |
|
|
@@ -105,6 +105,8 @@ Options:
|
|
| 105 |
- `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "en,fr,de"` for English, French, and German.
|
| 106 |
- `config --help`: List all available builders, processors, and converters, and their associated configuration. These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
|
| 107 |
- `--converter_cls`: One of `marker.converters.pdf.PdfConverter` (default) or `marker.converters.table.TableConverter`. The `PdfConverter` will convert the whole PDF, the `TableConverter` will only extract and convert tables.
|
|
|
|
|
|
|
| 108 |
|
| 109 |
The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/recognition/languages.py). If you don't need OCR, marker can work with any language.
|
| 110 |
|
|
@@ -146,7 +148,7 @@ text, _, images = text_from_rendered(rendered)
|
|
| 146 |
|
| 147 |
### Custom configuration
|
| 148 |
|
| 149 |
-
You can pass configuration using the `ConfigParser
|
| 150 |
|
| 151 |
```python
|
| 152 |
from marker.converters.pdf import PdfConverter
|
|
@@ -310,6 +312,14 @@ All output formats will return a metadata dictionary, with the following fields:
|
|
| 310 |
}
|
| 311 |
```
|
| 312 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
# Internals
|
| 314 |
|
| 315 |
Marker is easy to extend. The core units of marker are:
|
|
|
|
| 22 |
|
| 23 |
## Hybrid Mode
|
| 24 |
|
| 25 |
+
For the highest accuracy, pass the `--use_llm` flag to use an LLM alongside marker. This will do things like merge tables across pages, handle inline math, format tables properly, and extract values from forms. It can use any Google model (`gemini-2.0-flash` by default), or any ollama model. See [below](#llm-services) for details.
|
| 26 |
|
| 27 |
Here is a table benchmark comparing marker, gemini flash alone, and marker with use_llm:
|
| 28 |
|
|
|
|
| 42 |
|
| 43 |
I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
|
| 44 |
|
| 45 |
+
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under \$5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
|
| 46 |
|
| 47 |
# Hosted API
|
| 48 |
|
|
|
|
| 105 |
- `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "en,fr,de"` for English, French, and German.
|
| 106 |
- `config --help`: List all available builders, processors, and converters, and their associated configuration. These values can be used to build a JSON configuration file for additional tweaking of marker defaults.
|
| 107 |
- `--converter_cls`: One of `marker.converters.pdf.PdfConverter` (default) or `marker.converters.table.TableConverter`. The `PdfConverter` will convert the whole PDF, the `TableConverter` will only extract and convert tables.
|
| 108 |
+
- `--llm_service`: Which llm service to use if `--use_llm` is passed. This defaults to `marker.services.gemini.GoogleGeminiService`.
|
| 109 |
+
- `--help`: see all of the flags that can be passed into marker. (it supports many more options then are listed above)
|
| 110 |
|
| 111 |
The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/recognition/languages.py). If you don't need OCR, marker can work with any language.
|
| 112 |
|
|
|
|
| 148 |
|
| 149 |
### Custom configuration
|
| 150 |
|
| 151 |
+
You can pass configuration using the `ConfigParser`. To see all available options, do `marker_single --help`.
|
| 152 |
|
| 153 |
```python
|
| 154 |
from marker.converters.pdf import PdfConverter
|
|
|
|
| 312 |
}
|
| 313 |
```
|
| 314 |
|
| 315 |
+
# LLM Services
|
| 316 |
+
|
| 317 |
+
When running with the `--use_llm` flag, you have a choice of services you can use:
|
| 318 |
+
|
| 319 |
+
- `Gemini` - this will use the Gemini developer API by default. You'll need to pass `--gemini_api_key` to configuration.
|
| 320 |
+
- `Google Vertex` - this will use vertex, which can be more reliable. You'll need to pass `--vertex_project_id` and `--vertex_location`. To use it, set `--llm_service=marker.services.vertex.GoogleVertexService`.
|
| 321 |
+
- `Ollama` - this will use local models. You can configure `--ollama_base_url` and `--ollama_model`. To use it, set `--llm_service=marker.services.vertex.OllamaService`.
|
| 322 |
+
|
| 323 |
# Internals
|
| 324 |
|
| 325 |
Marker is easy to extend. The core units of marker are:
|
marker/builders/llm_layout.py
CHANGED
|
@@ -1,12 +1,12 @@
|
|
| 1 |
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 2 |
-
from typing import Annotated
|
| 3 |
|
| 4 |
from surya.layout import LayoutPredictor
|
| 5 |
from tqdm import tqdm
|
| 6 |
from pydantic import BaseModel
|
| 7 |
|
| 8 |
from marker.builders.layout import LayoutBuilder
|
| 9 |
-
from marker.services
|
| 10 |
from marker.providers.pdf import PdfProvider
|
| 11 |
from marker.schema import BlockTypes
|
| 12 |
from marker.schema.blocks import Block
|
|
@@ -97,10 +97,10 @@ Potential labels:
|
|
| 97 |
Respond only with one of `Figure`, `Picture`, `ComplexRegion`, `Table`, or `Form`.
|
| 98 |
"""
|
| 99 |
|
| 100 |
-
def __init__(self, layout_model: LayoutPredictor, config=None):
|
| 101 |
super().__init__(layout_model, config)
|
| 102 |
|
| 103 |
-
self.
|
| 104 |
|
| 105 |
def __call__(self, document: Document, provider: PdfProvider):
|
| 106 |
super().__call__(document, provider)
|
|
@@ -158,7 +158,7 @@ Respond only with one of `Figure`, `Picture`, `ComplexRegion`, `Table`, or `Form
|
|
| 158 |
def process_block_relabeling(self, document: Document, page: PageGroup, block: Block, prompt: str):
|
| 159 |
image = self.extract_image(document, block)
|
| 160 |
|
| 161 |
-
response = self.
|
| 162 |
prompt,
|
| 163 |
image,
|
| 164 |
block,
|
|
|
|
| 1 |
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 2 |
+
from typing import Annotated, Type
|
| 3 |
|
| 4 |
from surya.layout import LayoutPredictor
|
| 5 |
from tqdm import tqdm
|
| 6 |
from pydantic import BaseModel
|
| 7 |
|
| 8 |
from marker.builders.layout import LayoutBuilder
|
| 9 |
+
from marker.services import BaseService
|
| 10 |
from marker.providers.pdf import PdfProvider
|
| 11 |
from marker.schema import BlockTypes
|
| 12 |
from marker.schema.blocks import Block
|
|
|
|
| 97 |
Respond only with one of `Figure`, `Picture`, `ComplexRegion`, `Table`, or `Form`.
|
| 98 |
"""
|
| 99 |
|
| 100 |
+
def __init__(self, layout_model: LayoutPredictor, llm_service: BaseService, config=None):
|
| 101 |
super().__init__(layout_model, config)
|
| 102 |
|
| 103 |
+
self.llm_service = llm_service
|
| 104 |
|
| 105 |
def __call__(self, document: Document, provider: PdfProvider):
|
| 106 |
super().__call__(document, provider)
|
|
|
|
| 158 |
def process_block_relabeling(self, document: Document, page: PageGroup, block: Block, prompt: str):
|
| 159 |
image = self.extract_image(document, block)
|
| 160 |
|
| 161 |
+
response = self.llm_service(
|
| 162 |
prompt,
|
| 163 |
image,
|
| 164 |
block,
|
marker/config/crawler.py
CHANGED
|
@@ -9,10 +9,11 @@ from marker.converters import BaseConverter
|
|
| 9 |
from marker.processors import BaseProcessor
|
| 10 |
from marker.providers import BaseProvider
|
| 11 |
from marker.renderers import BaseRenderer
|
|
|
|
| 12 |
|
| 13 |
|
| 14 |
class ConfigCrawler:
|
| 15 |
-
def __init__(self, base_classes=(BaseBuilder, BaseProcessor, BaseConverter, BaseProvider, BaseRenderer)):
|
| 16 |
self.base_classes = base_classes
|
| 17 |
self.class_config_map = {}
|
| 18 |
|
|
|
|
| 9 |
from marker.processors import BaseProcessor
|
| 10 |
from marker.providers import BaseProvider
|
| 11 |
from marker.renderers import BaseRenderer
|
| 12 |
+
from marker.services import BaseService
|
| 13 |
|
| 14 |
|
| 15 |
class ConfigCrawler:
|
| 16 |
+
def __init__(self, base_classes=(BaseBuilder, BaseProcessor, BaseConverter, BaseProvider, BaseRenderer, BaseService)):
|
| 17 |
self.base_classes = base_classes
|
| 18 |
self.class_config_map = {}
|
| 19 |
|
marker/config/parser.py
CHANGED
|
@@ -39,9 +39,9 @@ class ConfigParser:
|
|
| 39 |
fn = click.option("--languages", type=str, default=None, help="Comma separated list of languages to use for OCR.")(fn)
|
| 40 |
|
| 41 |
# we put common options here
|
| 42 |
-
fn = click.option("--google_api_key", type=str, default=None, help="Google API key for using LLMs.")(fn)
|
| 43 |
fn = click.option("--use_llm", is_flag=True, default=False, help="Enable higher quality processing with LLMs.")(fn)
|
| 44 |
fn = click.option("--converter_cls", type=str, default=None, help="Converter class to use. Defaults to PDF converter.")(fn)
|
|
|
|
| 45 |
|
| 46 |
# enum options
|
| 47 |
fn = click.option("--force_layout_block", type=click.Choice(choices=[t.name for t in BlockTypes]), default=None,)(fn)
|
|
@@ -74,8 +74,23 @@ class ConfigParser:
|
|
| 74 |
case _:
|
| 75 |
if k in crawler.attr_set:
|
| 76 |
config[k] = v
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
return config
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
def get_renderer(self):
|
| 80 |
match self.cli_options["output_format"]:
|
| 81 |
case "json":
|
|
@@ -122,3 +137,4 @@ class ConfigParser:
|
|
| 122 |
def get_base_filename(self, filepath: str):
|
| 123 |
basename = os.path.basename(filepath)
|
| 124 |
return os.path.splitext(basename)[0]
|
|
|
|
|
|
| 39 |
fn = click.option("--languages", type=str, default=None, help="Comma separated list of languages to use for OCR.")(fn)
|
| 40 |
|
| 41 |
# we put common options here
|
|
|
|
| 42 |
fn = click.option("--use_llm", is_flag=True, default=False, help="Enable higher quality processing with LLMs.")(fn)
|
| 43 |
fn = click.option("--converter_cls", type=str, default=None, help="Converter class to use. Defaults to PDF converter.")(fn)
|
| 44 |
+
fn = click.option("--llm_service", type=str, default=None, help="LLM service to use - should be full import path, like marker.services.gemini.GoogleGeminiService")(fn)
|
| 45 |
|
| 46 |
# enum options
|
| 47 |
fn = click.option("--force_layout_block", type=click.Choice(choices=[t.name for t in BlockTypes]), default=None,)(fn)
|
|
|
|
| 74 |
case _:
|
| 75 |
if k in crawler.attr_set:
|
| 76 |
config[k] = v
|
| 77 |
+
|
| 78 |
+
# Backward compatibility for google_api_key
|
| 79 |
+
if settings.GOOGLE_API_KEY:
|
| 80 |
+
config["gemini_api_key"] = settings.GOOGLE_API_KEY
|
| 81 |
+
|
| 82 |
return config
|
| 83 |
|
| 84 |
+
def get_llm_service(self):
|
| 85 |
+
# Only return an LLM service when use_llm is enabled
|
| 86 |
+
if not self.cli_options["use_llm"]:
|
| 87 |
+
return None
|
| 88 |
+
|
| 89 |
+
service_cls = self.cli_options["llm_service"]
|
| 90 |
+
if service_cls is None:
|
| 91 |
+
service_cls = "marker.services.gemini.GoogleGeminiService"
|
| 92 |
+
return service_cls
|
| 93 |
+
|
| 94 |
def get_renderer(self):
|
| 95 |
match self.cli_options["output_format"]:
|
| 96 |
case "json":
|
|
|
|
| 137 |
def get_base_filename(self, filepath: str):
|
| 138 |
basename = os.path.basename(filepath)
|
| 139 |
return os.path.splitext(basename)[0]
|
| 140 |
+
|
marker/converters/__init__.py
CHANGED
|
@@ -13,6 +13,7 @@ class BaseConverter:
|
|
| 13 |
def __init__(self, config: Optional[BaseModel | dict] = None):
|
| 14 |
assign_config(self, config)
|
| 15 |
self.config = config
|
|
|
|
| 16 |
|
| 17 |
def __call__(self, *args, **kwargs):
|
| 18 |
raise NotImplementedError
|
|
@@ -52,7 +53,8 @@ class BaseConverter:
|
|
| 52 |
|
| 53 |
meta_processor = LLMSimpleBlockMetaProcessor(
|
| 54 |
processor_lst=simple_llm_processors,
|
| 55 |
-
|
|
|
|
| 56 |
)
|
| 57 |
other_processors.insert(insert_position, meta_processor)
|
| 58 |
return other_processors
|
|
|
|
| 13 |
def __init__(self, config: Optional[BaseModel | dict] = None):
|
| 14 |
assign_config(self, config)
|
| 15 |
self.config = config
|
| 16 |
+
self.llm_service = None
|
| 17 |
|
| 18 |
def __call__(self, *args, **kwargs):
|
| 19 |
raise NotImplementedError
|
|
|
|
| 53 |
|
| 54 |
meta_processor = LLMSimpleBlockMetaProcessor(
|
| 55 |
processor_lst=simple_llm_processors,
|
| 56 |
+
llm_service=self.llm_service,
|
| 57 |
+
config=self.config,
|
| 58 |
)
|
| 59 |
other_processors.insert(insert_position, meta_processor)
|
| 60 |
return other_processors
|
marker/converters/pdf.py
CHANGED
|
@@ -1,4 +1,7 @@
|
|
| 1 |
import os
|
|
|
|
|
|
|
|
|
|
| 2 |
os.environ["TOKENIZERS_PARALLELISM"] = "false" # disables a tokenizers warning
|
| 3 |
|
| 4 |
import inspect
|
|
@@ -86,7 +89,14 @@ class PdfConverter(BaseConverter):
|
|
| 86 |
DebugProcessor,
|
| 87 |
)
|
| 88 |
|
| 89 |
-
def __init__(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
super().__init__(config)
|
| 91 |
|
| 92 |
for block_type, override_block_type in self.override_map.items():
|
|
@@ -102,6 +112,14 @@ class PdfConverter(BaseConverter):
|
|
| 102 |
else:
|
| 103 |
renderer = MarkdownRenderer
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
self.artifact_dict = artifact_dict
|
| 106 |
self.renderer = renderer
|
| 107 |
|
|
|
|
| 1 |
import os
|
| 2 |
+
|
| 3 |
+
from marker.services.gemini import GoogleGeminiService
|
| 4 |
+
|
| 5 |
os.environ["TOKENIZERS_PARALLELISM"] = "false" # disables a tokenizers warning
|
| 6 |
|
| 7 |
import inspect
|
|
|
|
| 89 |
DebugProcessor,
|
| 90 |
)
|
| 91 |
|
| 92 |
+
def __init__(
|
| 93 |
+
self,
|
| 94 |
+
artifact_dict: Dict[str, Any],
|
| 95 |
+
processor_list: Optional[List[str]] = None,
|
| 96 |
+
renderer: str | None = None,
|
| 97 |
+
llm_service: str | None = None,
|
| 98 |
+
config=None
|
| 99 |
+
):
|
| 100 |
super().__init__(config)
|
| 101 |
|
| 102 |
for block_type, override_block_type in self.override_map.items():
|
|
|
|
| 112 |
else:
|
| 113 |
renderer = MarkdownRenderer
|
| 114 |
|
| 115 |
+
if llm_service:
|
| 116 |
+
llm_service_cls = strings_to_classes([llm_service])[0]
|
| 117 |
+
llm_service = self.resolve_dependencies(llm_service_cls)
|
| 118 |
+
|
| 119 |
+
# Inject llm service into artifact_dict so it can be picked up by processors, etc.
|
| 120 |
+
artifact_dict["llm_service"] = llm_service
|
| 121 |
+
self.llm_service = llm_service
|
| 122 |
+
|
| 123 |
self.artifact_dict = artifact_dict
|
| 124 |
self.renderer = renderer
|
| 125 |
|
marker/processors/llm/__init__.py
CHANGED
|
@@ -8,11 +8,12 @@ from PIL import Image
|
|
| 8 |
|
| 9 |
from marker.processors import BaseProcessor
|
| 10 |
from marker.schema import BlockTypes
|
| 11 |
-
from marker.services.google import GoogleModel
|
| 12 |
from marker.schema.blocks import Block
|
| 13 |
from marker.schema.document import Document
|
| 14 |
from marker.schema.groups import PageGroup
|
|
|
|
| 15 |
from marker.settings import settings
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
class PromptData(TypedDict):
|
|
@@ -67,14 +68,14 @@ class BaseLLMProcessor(BaseProcessor):
|
|
| 67 |
] = False
|
| 68 |
block_types = None
|
| 69 |
|
| 70 |
-
def __init__(self, config=None):
|
| 71 |
super().__init__(config)
|
| 72 |
|
| 73 |
-
self.
|
| 74 |
if not self.use_llm:
|
| 75 |
return
|
| 76 |
|
| 77 |
-
self.
|
| 78 |
|
| 79 |
def extract_image(self, document: Document, image_block: Block, remove_blocks: Sequence[BlockTypes] | None = None) -> Image.Image:
|
| 80 |
return image_block.get_image(
|
|
@@ -90,7 +91,7 @@ class BaseLLMComplexBlockProcessor(BaseLLMProcessor):
|
|
| 90 |
A processor for using LLMs to convert blocks with more complex logic.
|
| 91 |
"""
|
| 92 |
def __call__(self, document: Document):
|
| 93 |
-
if not self.use_llm or self.
|
| 94 |
return
|
| 95 |
|
| 96 |
try:
|
|
@@ -125,6 +126,10 @@ class BaseLLMSimpleBlockProcessor(BaseLLMProcessor):
|
|
| 125 |
A processor for using LLMs to convert single blocks.
|
| 126 |
"""
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
def __call__(self, result: dict, prompt_data: PromptData, document: Document):
|
| 129 |
try:
|
| 130 |
self.rewrite_block(result, prompt_data, document)
|
|
|
|
| 8 |
|
| 9 |
from marker.processors import BaseProcessor
|
| 10 |
from marker.schema import BlockTypes
|
|
|
|
| 11 |
from marker.schema.blocks import Block
|
| 12 |
from marker.schema.document import Document
|
| 13 |
from marker.schema.groups import PageGroup
|
| 14 |
+
from marker.services import BaseService
|
| 15 |
from marker.settings import settings
|
| 16 |
+
from marker.util import assign_config
|
| 17 |
|
| 18 |
|
| 19 |
class PromptData(TypedDict):
|
|
|
|
| 68 |
] = False
|
| 69 |
block_types = None
|
| 70 |
|
| 71 |
+
def __init__(self, llm_service: BaseService, config=None):
|
| 72 |
super().__init__(config)
|
| 73 |
|
| 74 |
+
self.llm_service = None
|
| 75 |
if not self.use_llm:
|
| 76 |
return
|
| 77 |
|
| 78 |
+
self.llm_service = llm_service
|
| 79 |
|
| 80 |
def extract_image(self, document: Document, image_block: Block, remove_blocks: Sequence[BlockTypes] | None = None) -> Image.Image:
|
| 81 |
return image_block.get_image(
|
|
|
|
| 91 |
A processor for using LLMs to convert blocks with more complex logic.
|
| 92 |
"""
|
| 93 |
def __call__(self, document: Document):
|
| 94 |
+
if not self.use_llm or self.llm_service is None:
|
| 95 |
return
|
| 96 |
|
| 97 |
try:
|
|
|
|
| 126 |
A processor for using LLMs to convert single blocks.
|
| 127 |
"""
|
| 128 |
|
| 129 |
+
# Override init since we don't need an llmservice here
|
| 130 |
+
def __init__(self, config=None):
|
| 131 |
+
assign_config(self, config)
|
| 132 |
+
|
| 133 |
def __call__(self, result: dict, prompt_data: PromptData, document: Document):
|
| 134 |
try:
|
| 135 |
self.rewrite_block(result, prompt_data, document)
|
marker/processors/llm/llm_meta.py
CHANGED
|
@@ -5,18 +5,19 @@ from tqdm import tqdm
|
|
| 5 |
|
| 6 |
from marker.processors.llm import BaseLLMSimpleBlockProcessor, BaseLLMProcessor
|
| 7 |
from marker.schema.document import Document
|
|
|
|
| 8 |
|
| 9 |
|
| 10 |
class LLMSimpleBlockMetaProcessor(BaseLLMProcessor):
|
| 11 |
"""
|
| 12 |
A wrapper for simple LLM processors, so they can all run in parallel.
|
| 13 |
"""
|
| 14 |
-
def __init__(self, processor_lst: List[BaseLLMSimpleBlockProcessor], config=None):
|
| 15 |
-
super().__init__(config)
|
| 16 |
self.processors = processor_lst
|
| 17 |
|
| 18 |
def __call__(self, document: Document):
|
| 19 |
-
if not self.use_llm or self.
|
| 20 |
return
|
| 21 |
|
| 22 |
total = sum([len(processor.inference_blocks(document)) for processor in self.processors])
|
|
@@ -50,4 +51,4 @@ class LLMSimpleBlockMetaProcessor(BaseLLMProcessor):
|
|
| 50 |
pbar.close()
|
| 51 |
|
| 52 |
def get_response(self, prompt_data: Dict[str, Any]):
|
| 53 |
-
return self.
|
|
|
|
| 5 |
|
| 6 |
from marker.processors.llm import BaseLLMSimpleBlockProcessor, BaseLLMProcessor
|
| 7 |
from marker.schema.document import Document
|
| 8 |
+
from marker.services import BaseService
|
| 9 |
|
| 10 |
|
| 11 |
class LLMSimpleBlockMetaProcessor(BaseLLMProcessor):
|
| 12 |
"""
|
| 13 |
A wrapper for simple LLM processors, so they can all run in parallel.
|
| 14 |
"""
|
| 15 |
+
def __init__(self, processor_lst: List[BaseLLMSimpleBlockProcessor], llm_service: BaseService, config=None):
|
| 16 |
+
super().__init__(llm_service, config)
|
| 17 |
self.processors = processor_lst
|
| 18 |
|
| 19 |
def __call__(self, document: Document):
|
| 20 |
+
if not self.use_llm or self.llm_service is None:
|
| 21 |
return
|
| 22 |
|
| 23 |
total = sum([len(processor.inference_blocks(document)) for processor in self.processors])
|
|
|
|
| 51 |
pbar.close()
|
| 52 |
|
| 53 |
def get_response(self, prompt_data: Dict[str, Any]):
|
| 54 |
+
return self.llm_service(prompt_data["prompt"], prompt_data["image"], prompt_data["block"], prompt_data["schema"])
|
marker/processors/llm/llm_table.py
CHANGED
|
@@ -134,7 +134,7 @@ No corrections needed.
|
|
| 134 |
def rewrite_single_chunk(self, page: PageGroup, block: Block, block_html: str, children: List[TableCell], image: Image.Image):
|
| 135 |
prompt = self.table_rewriting_prompt.replace("{block_html}", block_html)
|
| 136 |
|
| 137 |
-
response = self.
|
| 138 |
|
| 139 |
if not response or "corrected_html" not in response:
|
| 140 |
block.update_metadata(llm_error_count=1)
|
|
|
|
| 134 |
def rewrite_single_chunk(self, page: PageGroup, block: Block, block_html: str, children: List[TableCell], image: Image.Image):
|
| 135 |
prompt = self.table_rewriting_prompt.replace("{block_html}", block_html)
|
| 136 |
|
| 137 |
+
response = self.llm_service(prompt, image, block, TableSchema)
|
| 138 |
|
| 139 |
if not response or "corrected_html" not in response:
|
| 140 |
block.update_metadata(llm_error_count=1)
|
marker/processors/llm/llm_table_merge.py
CHANGED
|
@@ -240,7 +240,7 @@ Table 2
|
|
| 240 |
|
| 241 |
prompt = self.table_merge_prompt.replace("{{table1}}", start_html).replace("{{table2}}", curr_html)
|
| 242 |
|
| 243 |
-
response = self.
|
| 244 |
prompt,
|
| 245 |
[start_image, curr_image],
|
| 246 |
curr_block,
|
|
|
|
| 240 |
|
| 241 |
prompt = self.table_merge_prompt.replace("{{table1}}", start_html).replace("{{table2}}", curr_html)
|
| 242 |
|
| 243 |
+
response = self.llm_service(
|
| 244 |
prompt,
|
| 245 |
[start_image, curr_image],
|
| 246 |
curr_block,
|
marker/scripts/convert.py
CHANGED
|
@@ -51,7 +51,8 @@ def process_single_pdf(args):
|
|
| 51 |
config=config_parser.generate_config_dict(),
|
| 52 |
artifact_dict=model_refs,
|
| 53 |
processor_list=config_parser.get_processors(),
|
| 54 |
-
renderer=config_parser.get_renderer()
|
|
|
|
| 55 |
)
|
| 56 |
rendered = converter(fpath)
|
| 57 |
out_folder = config_parser.get_output_folder(fpath)
|
|
|
|
| 51 |
config=config_parser.generate_config_dict(),
|
| 52 |
artifact_dict=model_refs,
|
| 53 |
processor_list=config_parser.get_processors(),
|
| 54 |
+
renderer=config_parser.get_renderer(),
|
| 55 |
+
llm_service=config_parser.get_llm_service()
|
| 56 |
)
|
| 57 |
rendered = converter(fpath)
|
| 58 |
out_folder = config_parser.get_output_folder(fpath)
|
marker/scripts/convert_single.py
CHANGED
|
@@ -29,7 +29,8 @@ def convert_single_cli(fpath: str, **kwargs):
|
|
| 29 |
config=config_parser.generate_config_dict(),
|
| 30 |
artifact_dict=models,
|
| 31 |
processor_list=config_parser.get_processors(),
|
| 32 |
-
renderer=config_parser.get_renderer()
|
|
|
|
| 33 |
)
|
| 34 |
rendered = converter(fpath)
|
| 35 |
out_folder = config_parser.get_output_folder(fpath)
|
|
|
|
| 29 |
config=config_parser.generate_config_dict(),
|
| 30 |
artifact_dict=models,
|
| 31 |
processor_list=config_parser.get_processors(),
|
| 32 |
+
renderer=config_parser.get_renderer(),
|
| 33 |
+
llm_service=config_parser.get_llm_service()
|
| 34 |
)
|
| 35 |
rendered = converter(fpath)
|
| 36 |
out_folder = config_parser.get_output_folder(fpath)
|
marker/scripts/server.py
CHANGED
|
@@ -95,7 +95,8 @@ async def _convert_pdf(params: CommonParams):
|
|
| 95 |
config=config_dict,
|
| 96 |
artifact_dict=app_data["models"],
|
| 97 |
processor_list=config_parser.get_processors(),
|
| 98 |
-
renderer=config_parser.get_renderer()
|
|
|
|
| 99 |
)
|
| 100 |
rendered = converter(params.filepath)
|
| 101 |
text, _, images = text_from_rendered(rendered)
|
|
|
|
| 95 |
config=config_dict,
|
| 96 |
artifact_dict=app_data["models"],
|
| 97 |
processor_list=config_parser.get_processors(),
|
| 98 |
+
renderer=config_parser.get_renderer(),
|
| 99 |
+
llm_service=config_parser.get_llm_service()
|
| 100 |
)
|
| 101 |
rendered = converter(params.filepath)
|
| 102 |
text, _, images = text_from_rendered(rendered)
|
marker/scripts/streamlit_app.py
CHANGED
|
@@ -56,7 +56,8 @@ def convert_pdf(fname: str, config_parser: ConfigParser) -> (str, Dict[str, Any]
|
|
| 56 |
config=config_dict,
|
| 57 |
artifact_dict=model_dict,
|
| 58 |
processor_list=config_parser.get_processors(),
|
| 59 |
-
renderer=config_parser.get_renderer()
|
|
|
|
| 60 |
)
|
| 61 |
return converter(fname)
|
| 62 |
|
|
|
|
| 56 |
config=config_dict,
|
| 57 |
artifact_dict=model_dict,
|
| 58 |
processor_list=config_parser.get_processors(),
|
| 59 |
+
renderer=config_parser.get_renderer(),
|
| 60 |
+
llm_service=config_parser.get_llm_service()
|
| 61 |
)
|
| 62 |
return converter(fname)
|
| 63 |
|
marker/services/__init__.py
CHANGED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Optional, List
|
| 2 |
+
|
| 3 |
+
import PIL
|
| 4 |
+
from pydantic import BaseModel
|
| 5 |
+
|
| 6 |
+
from marker.schema.blocks import Block
|
| 7 |
+
from marker.util import assign_config, verify_config_keys
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
class BaseService:
|
| 11 |
+
def __init__(self, config: Optional[BaseModel | dict] = None):
|
| 12 |
+
assign_config(self, config)
|
| 13 |
+
|
| 14 |
+
# Ensure we have all necessary fields filled out (API keys, etc.)
|
| 15 |
+
verify_config_keys(self)
|
| 16 |
+
|
| 17 |
+
def __call__(
|
| 18 |
+
self,
|
| 19 |
+
prompt: str,
|
| 20 |
+
image: PIL.Image.Image | List[PIL.Image.Image],
|
| 21 |
+
block: Block,
|
| 22 |
+
response_schema: type[BaseModel],
|
| 23 |
+
max_retries: int = 1,
|
| 24 |
+
timeout: int = 15
|
| 25 |
+
):
|
| 26 |
+
raise NotImplementedError
|
marker/services/{google.py → gemini.py}
RENAMED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
import json
|
| 2 |
import time
|
| 3 |
from io import BytesIO
|
| 4 |
-
from typing import List
|
| 5 |
|
| 6 |
import PIL
|
| 7 |
from google import genai
|
|
@@ -10,29 +10,23 @@ from google.genai.errors import APIError
|
|
| 10 |
from pydantic import BaseModel
|
| 11 |
|
| 12 |
from marker.schema.blocks import Block
|
| 13 |
-
from marker.
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
self.api_key = api_key
|
| 22 |
-
self.model_name = model_name
|
| 23 |
-
|
| 24 |
-
def get_google_client(self, timeout: int = 60):
|
| 25 |
-
return genai.Client(
|
| 26 |
-
api_key=settings.GOOGLE_API_KEY,
|
| 27 |
-
http_options={"timeout": timeout * 1000} # Convert to milliseconds
|
| 28 |
-
)
|
| 29 |
|
| 30 |
def img_to_bytes(self, img: PIL.Image.Image):
|
| 31 |
image_bytes = BytesIO()
|
| 32 |
img.save(image_bytes, format="PNG")
|
| 33 |
return image_bytes.getvalue()
|
| 34 |
|
| 35 |
-
def
|
|
|
|
|
|
|
|
|
|
| 36 |
self,
|
| 37 |
prompt: str,
|
| 38 |
image: PIL.Image.Image | List[PIL.Image.Image],
|
|
@@ -51,7 +45,7 @@ class GoogleModel:
|
|
| 51 |
while tries < max_retries:
|
| 52 |
try:
|
| 53 |
responses = client.models.generate_content(
|
| 54 |
-
model=
|
| 55 |
contents=image_parts + [prompt], # According to gemini docs, it performs better if the image is the first element
|
| 56 |
config={
|
| 57 |
"temperature": 0,
|
|
@@ -78,3 +72,16 @@ class GoogleModel:
|
|
| 78 |
break
|
| 79 |
|
| 80 |
return {}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
import json
|
| 2 |
import time
|
| 3 |
from io import BytesIO
|
| 4 |
+
from typing import List, Annotated
|
| 5 |
|
| 6 |
import PIL
|
| 7 |
from google import genai
|
|
|
|
| 10 |
from pydantic import BaseModel
|
| 11 |
|
| 12 |
from marker.schema.blocks import Block
|
| 13 |
+
from marker.services import BaseService
|
| 14 |
|
| 15 |
+
class BaseGeminiService(BaseService):
|
| 16 |
+
gemini_model_name: Annotated[
|
| 17 |
+
str,
|
| 18 |
+
"The name of the Google model to use for the service."
|
| 19 |
+
] = "gemini-2.0-flash"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
def img_to_bytes(self, img: PIL.Image.Image):
|
| 22 |
image_bytes = BytesIO()
|
| 23 |
img.save(image_bytes, format="PNG")
|
| 24 |
return image_bytes.getvalue()
|
| 25 |
|
| 26 |
+
def get_google_client(self, timeout: int = 60):
|
| 27 |
+
raise NotImplementedError
|
| 28 |
+
|
| 29 |
+
def __call__(
|
| 30 |
self,
|
| 31 |
prompt: str,
|
| 32 |
image: PIL.Image.Image | List[PIL.Image.Image],
|
|
|
|
| 45 |
while tries < max_retries:
|
| 46 |
try:
|
| 47 |
responses = client.models.generate_content(
|
| 48 |
+
model=self.gemini_model_name,
|
| 49 |
contents=image_parts + [prompt], # According to gemini docs, it performs better if the image is the first element
|
| 50 |
config={
|
| 51 |
"temperature": 0,
|
|
|
|
| 72 |
break
|
| 73 |
|
| 74 |
return {}
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
class GoogleGeminiService(BaseGeminiService):
|
| 78 |
+
gemini_api_key: Annotated[
|
| 79 |
+
str,
|
| 80 |
+
"The Google API key to use for the service."
|
| 81 |
+
] = None
|
| 82 |
+
|
| 83 |
+
def get_google_client(self, timeout: int = 60):
|
| 84 |
+
return genai.Client(
|
| 85 |
+
api_key=self.gemini_api_key,
|
| 86 |
+
http_options={"timeout": timeout * 1000} # Convert to milliseconds
|
| 87 |
+
)
|
marker/services/ollama.py
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import base64
|
| 2 |
+
import json
|
| 3 |
+
from io import BytesIO
|
| 4 |
+
from typing import Annotated, List
|
| 5 |
+
|
| 6 |
+
import PIL
|
| 7 |
+
import requests
|
| 8 |
+
from pydantic import BaseModel
|
| 9 |
+
|
| 10 |
+
from marker.schema.blocks import Block
|
| 11 |
+
from marker.services import BaseService
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class OllamaService(BaseService):
|
| 15 |
+
ollama_base_url: Annotated[
|
| 16 |
+
str,
|
| 17 |
+
"The base url to use for ollama. No trailing slash."
|
| 18 |
+
] = "http://localhost:11434"
|
| 19 |
+
ollama_model: Annotated[
|
| 20 |
+
str,
|
| 21 |
+
"The model name to use for ollama."
|
| 22 |
+
] = "llama3.2-vision"
|
| 23 |
+
|
| 24 |
+
def image_to_base64(self, image: PIL.Image.Image):
|
| 25 |
+
image_bytes = BytesIO()
|
| 26 |
+
image.save(image_bytes, format="PNG")
|
| 27 |
+
return base64.b64encode(image_bytes.getvalue()).decode("utf-8")
|
| 28 |
+
|
| 29 |
+
def __call__(
|
| 30 |
+
self,
|
| 31 |
+
prompt: str,
|
| 32 |
+
image: PIL.Image.Image | List[PIL.Image.Image],
|
| 33 |
+
block: Block,
|
| 34 |
+
response_schema: type[BaseModel],
|
| 35 |
+
max_retries: int = 1,
|
| 36 |
+
timeout: int = 15
|
| 37 |
+
):
|
| 38 |
+
url = f"{self.ollama_base_url}/api/generate"
|
| 39 |
+
headers = {"Content-Type": "application/json"}
|
| 40 |
+
|
| 41 |
+
schema = response_schema.model_json_schema()
|
| 42 |
+
format_schema = {
|
| 43 |
+
"type": "object",
|
| 44 |
+
"properties": schema["properties"],
|
| 45 |
+
"required": schema["required"]
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
if not isinstance(image, list):
|
| 49 |
+
image = [image]
|
| 50 |
+
|
| 51 |
+
image_bytes = [self.image_to_base64(img) for img in image]
|
| 52 |
+
|
| 53 |
+
payload = {
|
| 54 |
+
"model": self.ollama_model,
|
| 55 |
+
"prompt": prompt,
|
| 56 |
+
"stream": False,
|
| 57 |
+
"format": format_schema,
|
| 58 |
+
"images": image_bytes
|
| 59 |
+
}
|
| 60 |
+
|
| 61 |
+
try:
|
| 62 |
+
response = requests.post(url, json=payload, headers=headers)
|
| 63 |
+
response.raise_for_status()
|
| 64 |
+
response_data = response.json()
|
| 65 |
+
data = response_data["response"]
|
| 66 |
+
print(data)
|
| 67 |
+
return json.loads(data)
|
| 68 |
+
except Exception as e:
|
| 69 |
+
print(f"Ollama inference failed: {e}")
|
| 70 |
+
|
| 71 |
+
return {}
|
marker/services/vertex.py
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Annotated
|
| 2 |
+
|
| 3 |
+
from google import genai
|
| 4 |
+
|
| 5 |
+
from marker.services.gemini import BaseGeminiService
|
| 6 |
+
|
| 7 |
+
class GoogleVertexService(BaseGeminiService):
|
| 8 |
+
vertex_project_id: Annotated[
|
| 9 |
+
str,
|
| 10 |
+
"Google Cloud Project ID for Vertex AI.",
|
| 11 |
+
] = None
|
| 12 |
+
vertex_location: Annotated[
|
| 13 |
+
str,
|
| 14 |
+
"Google Cloud Location for Vertex AI.",
|
| 15 |
+
] = None
|
| 16 |
+
|
| 17 |
+
def get_google_client(self, timeout: int = 60):
|
| 18 |
+
return genai.Client(
|
| 19 |
+
vertexai=True,
|
| 20 |
+
project=self.vertex_project_id,
|
| 21 |
+
location=self.vertex_location,
|
| 22 |
+
http_options={"timeout": timeout * 1000} # Convert to milliseconds
|
| 23 |
+
)
|
marker/util.py
CHANGED
|
@@ -1,7 +1,6 @@
|
|
| 1 |
import inspect
|
| 2 |
-
import re
|
| 3 |
from importlib import import_module
|
| 4 |
-
from typing import List
|
| 5 |
|
| 6 |
import numpy as np
|
| 7 |
from pydantic import BaseModel
|
|
@@ -24,6 +23,19 @@ def classes_to_strings(items: List[type]) -> List[str]:
|
|
| 24 |
return [f"{item.__module__}.{item.__name__}" for item in items]
|
| 25 |
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
def assign_config(cls, config: BaseModel | dict | None):
|
| 28 |
cls_name = cls.__class__.__name__
|
| 29 |
if config is None:
|
|
|
|
| 1 |
import inspect
|
|
|
|
| 2 |
from importlib import import_module
|
| 3 |
+
from typing import List, Annotated
|
| 4 |
|
| 5 |
import numpy as np
|
| 6 |
from pydantic import BaseModel
|
|
|
|
| 23 |
return [f"{item.__module__}.{item.__name__}" for item in items]
|
| 24 |
|
| 25 |
|
| 26 |
+
def verify_config_keys(obj):
|
| 27 |
+
annotations = inspect.get_annotations(obj.__class__)
|
| 28 |
+
|
| 29 |
+
none_vals = ""
|
| 30 |
+
for attr_name, annotation in annotations.items():
|
| 31 |
+
if isinstance(annotation, type(Annotated[str, ""])):
|
| 32 |
+
value = getattr(obj, attr_name)
|
| 33 |
+
if value is None:
|
| 34 |
+
none_vals += f"{attr_name}, "
|
| 35 |
+
|
| 36 |
+
assert len(none_vals) == 0, f"Missing values for {none_vals} are not allowed in {obj.__class__.__name__}."
|
| 37 |
+
|
| 38 |
+
|
| 39 |
def assign_config(cls, config: BaseModel | dict | None):
|
| 40 |
cls_name = cls.__class__.__name__
|
| 41 |
if config is None:
|
tests/conftest.py
CHANGED
|
@@ -18,6 +18,7 @@ from marker.schema.blocks import Block
|
|
| 18 |
from marker.renderers.markdown import MarkdownRenderer
|
| 19 |
from marker.renderers.json import JSONRenderer
|
| 20 |
from marker.schema.registry import register_block_class
|
|
|
|
| 21 |
from marker.util import classes_to_strings
|
| 22 |
|
| 23 |
@pytest.fixture(scope="session")
|
|
@@ -126,6 +127,17 @@ def renderer(request, config):
|
|
| 126 |
else:
|
| 127 |
return MarkdownRenderer
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
@pytest.fixture(scope="function")
|
| 130 |
def temp_image():
|
| 131 |
img = Image.new("RGB", (512, 512), color="white")
|
|
|
|
| 18 |
from marker.renderers.markdown import MarkdownRenderer
|
| 19 |
from marker.renderers.json import JSONRenderer
|
| 20 |
from marker.schema.registry import register_block_class
|
| 21 |
+
from marker.services.gemini import GoogleGeminiService
|
| 22 |
from marker.util import classes_to_strings
|
| 23 |
|
| 24 |
@pytest.fixture(scope="session")
|
|
|
|
| 127 |
else:
|
| 128 |
return MarkdownRenderer
|
| 129 |
|
| 130 |
+
|
| 131 |
+
@pytest.fixture(scope="function")
|
| 132 |
+
def llm_service(request):
|
| 133 |
+
llm_service = GoogleGeminiService(
|
| 134 |
+
config={
|
| 135 |
+
"gemini_api_key": "test"
|
| 136 |
+
}
|
| 137 |
+
)
|
| 138 |
+
yield llm_service
|
| 139 |
+
|
| 140 |
+
|
| 141 |
@pytest.fixture(scope="function")
|
| 142 |
def temp_image():
|
| 143 |
img = Image.new("RGB", (512, 512), color="white")
|
tests/processors/test_inline_math.py
CHANGED
|
@@ -17,12 +17,11 @@ def test_llm_text_processor(pdf_document, mocker):
|
|
| 17 |
corrected_lines = ["<math>Text</math>"] * len(text_lines)
|
| 18 |
|
| 19 |
mock_cls = Mock()
|
| 20 |
-
mock_cls.return_value
|
| 21 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 22 |
|
| 23 |
-
config = {"use_llm": True, "
|
| 24 |
processor_lst = [LLMTextProcessor(config)]
|
| 25 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst, config)
|
| 26 |
processor(pdf_document)
|
| 27 |
|
| 28 |
contained_spans = text_lines[0].contained_blocks(pdf_document, (BlockTypes.Span,))
|
|
|
|
| 17 |
corrected_lines = ["<math>Text</math>"] * len(text_lines)
|
| 18 |
|
| 19 |
mock_cls = Mock()
|
| 20 |
+
mock_cls.return_value = {"corrected_lines": corrected_lines}
|
|
|
|
| 21 |
|
| 22 |
+
config = {"use_llm": True, "gemini_api_key": "test"}
|
| 23 |
processor_lst = [LLMTextProcessor(config)]
|
| 24 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, mock_cls, config)
|
| 25 |
processor(pdf_document)
|
| 26 |
|
| 27 |
contained_spans = text_lines[0].contained_blocks(pdf_document, (BlockTypes.Span,))
|
tests/processors/test_llm_processors.py
CHANGED
|
@@ -14,11 +14,12 @@ from marker.renderers.markdown import MarkdownRenderer
|
|
| 14 |
from marker.schema import BlockTypes
|
| 15 |
from marker.schema.blocks import ComplexRegion
|
| 16 |
|
|
|
|
| 17 |
@pytest.mark.filename("form_1040.pdf")
|
| 18 |
@pytest.mark.config({"page_range": [0]})
|
| 19 |
-
def test_llm_form_processor_no_config(pdf_document):
|
| 20 |
processor_lst = [LLMFormProcessor()]
|
| 21 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst)
|
| 22 |
processor(pdf_document)
|
| 23 |
|
| 24 |
forms = pdf_document.contained_blocks((BlockTypes.Form,))
|
|
@@ -27,9 +28,10 @@ def test_llm_form_processor_no_config(pdf_document):
|
|
| 27 |
|
| 28 |
@pytest.mark.filename("form_1040.pdf")
|
| 29 |
@pytest.mark.config({"page_range": [0]})
|
| 30 |
-
def test_llm_form_processor_no_cells(pdf_document):
|
| 31 |
-
|
| 32 |
-
|
|
|
|
| 33 |
processor(pdf_document)
|
| 34 |
|
| 35 |
forms = pdf_document.contained_blocks((BlockTypes.Form,))
|
|
@@ -38,20 +40,19 @@ def test_llm_form_processor_no_cells(pdf_document):
|
|
| 38 |
|
| 39 |
@pytest.mark.filename("form_1040.pdf")
|
| 40 |
@pytest.mark.config({"page_range": [0]})
|
| 41 |
-
def test_llm_form_processor(pdf_document, detection_model, table_rec_model, recognition_model
|
| 42 |
corrected_html = "<em>This is corrected markdown.</em>\n" * 100
|
| 43 |
corrected_html = "<p>" + corrected_html.strip() + "</p>\n"
|
| 44 |
|
| 45 |
mock_cls = Mock()
|
| 46 |
-
mock_cls.return_value
|
| 47 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 48 |
|
| 49 |
cell_processor = TableProcessor(detection_model, recognition_model, table_rec_model)
|
| 50 |
cell_processor(pdf_document)
|
| 51 |
|
| 52 |
config = {"use_llm": True, "google_api_key": "test"}
|
| 53 |
processor_lst = [LLMFormProcessor(config)]
|
| 54 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst, config)
|
| 55 |
processor(pdf_document)
|
| 56 |
|
| 57 |
forms = pdf_document.contained_blocks((BlockTypes.Form,))
|
|
@@ -61,7 +62,7 @@ def test_llm_form_processor(pdf_document, detection_model, table_rec_model, reco
|
|
| 61 |
|
| 62 |
@pytest.mark.filename("table_ex2.pdf")
|
| 63 |
@pytest.mark.config({"page_range": [0]})
|
| 64 |
-
def test_llm_table_processor(pdf_document, detection_model, table_rec_model, recognition_model
|
| 65 |
corrected_html = """
|
| 66 |
<table>
|
| 67 |
<tr>
|
|
@@ -86,13 +87,12 @@ def test_llm_table_processor(pdf_document, detection_model, table_rec_model, rec
|
|
| 86 |
""".strip()
|
| 87 |
|
| 88 |
mock_cls = Mock()
|
| 89 |
-
mock_cls.return_value
|
| 90 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 91 |
|
| 92 |
cell_processor = TableProcessor(detection_model, recognition_model, table_rec_model)
|
| 93 |
cell_processor(pdf_document)
|
| 94 |
|
| 95 |
-
processor = LLMTableProcessor({"use_llm": True, "google_api_key": "test"})
|
| 96 |
processor(pdf_document)
|
| 97 |
|
| 98 |
tables = pdf_document.contained_blocks((BlockTypes.Table,))
|
|
@@ -107,8 +107,9 @@ def test_llm_table_processor(pdf_document, detection_model, table_rec_model, rec
|
|
| 107 |
@pytest.mark.config({"page_range": [0]})
|
| 108 |
def test_llm_caption_processor_disabled(pdf_document):
|
| 109 |
config = {"use_llm": True, "google_api_key": "test"}
|
|
|
|
| 110 |
processor_lst = [LLMImageDescriptionProcessor(config)]
|
| 111 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst, config)
|
| 112 |
processor(pdf_document)
|
| 113 |
|
| 114 |
contained_pictures = pdf_document.contained_blocks((BlockTypes.Picture, BlockTypes.Figure))
|
|
@@ -116,15 +117,14 @@ def test_llm_caption_processor_disabled(pdf_document):
|
|
| 116 |
|
| 117 |
@pytest.mark.filename("A17_FlightPlan.pdf")
|
| 118 |
@pytest.mark.config({"page_range": [0]})
|
| 119 |
-
def test_llm_caption_processor(pdf_document
|
| 120 |
description = "This is an image description."
|
| 121 |
mock_cls = Mock()
|
| 122 |
-
mock_cls.return_value
|
| 123 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 124 |
|
| 125 |
config = {"use_llm": True, "google_api_key": "test", "extract_images": False}
|
| 126 |
processor_lst = [LLMImageDescriptionProcessor(config)]
|
| 127 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst, config)
|
| 128 |
processor(pdf_document)
|
| 129 |
|
| 130 |
contained_pictures = pdf_document.contained_blocks((BlockTypes.Picture, BlockTypes.Figure))
|
|
@@ -139,11 +139,10 @@ def test_llm_caption_processor(pdf_document, mocker):
|
|
| 139 |
|
| 140 |
@pytest.mark.filename("A17_FlightPlan.pdf")
|
| 141 |
@pytest.mark.config({"page_range": [0]})
|
| 142 |
-
def test_llm_complex_region_processor(pdf_document
|
| 143 |
md = "This is some *markdown* for a complex region."
|
| 144 |
mock_cls = Mock()
|
| 145 |
-
mock_cls.return_value
|
| 146 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 147 |
|
| 148 |
# Replace the block with a complex region
|
| 149 |
old_block = pdf_document.pages[0].children[0]
|
|
@@ -155,7 +154,7 @@ def test_llm_complex_region_processor(pdf_document, mocker):
|
|
| 155 |
# Test processor
|
| 156 |
config = {"use_llm": True, "google_api_key": "test"}
|
| 157 |
processor_lst = [LLMComplexRegionProcessor(config)]
|
| 158 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst, config)
|
| 159 |
processor(pdf_document)
|
| 160 |
|
| 161 |
# Ensure the rendering includes the description
|
|
@@ -166,15 +165,14 @@ def test_llm_complex_region_processor(pdf_document, mocker):
|
|
| 166 |
|
| 167 |
@pytest.mark.filename("adversarial.pdf")
|
| 168 |
@pytest.mark.config({"page_range": [0]})
|
| 169 |
-
def test_multi_llm_processors(pdf_document
|
| 170 |
description = "<math>This is an image description. And here is a lot of writing about it.</math>" * 10
|
| 171 |
mock_cls = Mock()
|
| 172 |
-
mock_cls.return_value
|
| 173 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 174 |
|
| 175 |
config = {"use_llm": True, "google_api_key": "test", "extract_images": False, "min_equation_height": .001}
|
| 176 |
processor_lst = [LLMImageDescriptionProcessor(config), LLMEquationProcessor(config)]
|
| 177 |
-
processor = LLMSimpleBlockMetaProcessor(processor_lst, config)
|
| 178 |
processor(pdf_document)
|
| 179 |
|
| 180 |
contained_pictures = pdf_document.contained_blocks((BlockTypes.Picture, BlockTypes.Figure))
|
|
|
|
| 14 |
from marker.schema import BlockTypes
|
| 15 |
from marker.schema.blocks import ComplexRegion
|
| 16 |
|
| 17 |
+
|
| 18 |
@pytest.mark.filename("form_1040.pdf")
|
| 19 |
@pytest.mark.config({"page_range": [0]})
|
| 20 |
+
def test_llm_form_processor_no_config(pdf_document, llm_service):
|
| 21 |
processor_lst = [LLMFormProcessor()]
|
| 22 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, llm_service)
|
| 23 |
processor(pdf_document)
|
| 24 |
|
| 25 |
forms = pdf_document.contained_blocks((BlockTypes.Form,))
|
|
|
|
| 28 |
|
| 29 |
@pytest.mark.filename("form_1040.pdf")
|
| 30 |
@pytest.mark.config({"page_range": [0]})
|
| 31 |
+
def test_llm_form_processor_no_cells(pdf_document, llm_service):
|
| 32 |
+
config = {"use_llm": True, "google_api_key": "test"}
|
| 33 |
+
processor_lst = [LLMFormProcessor(config)]
|
| 34 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, llm_service, config)
|
| 35 |
processor(pdf_document)
|
| 36 |
|
| 37 |
forms = pdf_document.contained_blocks((BlockTypes.Form,))
|
|
|
|
| 40 |
|
| 41 |
@pytest.mark.filename("form_1040.pdf")
|
| 42 |
@pytest.mark.config({"page_range": [0]})
|
| 43 |
+
def test_llm_form_processor(pdf_document, detection_model, table_rec_model, recognition_model):
|
| 44 |
corrected_html = "<em>This is corrected markdown.</em>\n" * 100
|
| 45 |
corrected_html = "<p>" + corrected_html.strip() + "</p>\n"
|
| 46 |
|
| 47 |
mock_cls = Mock()
|
| 48 |
+
mock_cls.return_value = {"corrected_html": corrected_html}
|
|
|
|
| 49 |
|
| 50 |
cell_processor = TableProcessor(detection_model, recognition_model, table_rec_model)
|
| 51 |
cell_processor(pdf_document)
|
| 52 |
|
| 53 |
config = {"use_llm": True, "google_api_key": "test"}
|
| 54 |
processor_lst = [LLMFormProcessor(config)]
|
| 55 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, mock_cls, config)
|
| 56 |
processor(pdf_document)
|
| 57 |
|
| 58 |
forms = pdf_document.contained_blocks((BlockTypes.Form,))
|
|
|
|
| 62 |
|
| 63 |
@pytest.mark.filename("table_ex2.pdf")
|
| 64 |
@pytest.mark.config({"page_range": [0]})
|
| 65 |
+
def test_llm_table_processor(pdf_document, detection_model, table_rec_model, recognition_model):
|
| 66 |
corrected_html = """
|
| 67 |
<table>
|
| 68 |
<tr>
|
|
|
|
| 87 |
""".strip()
|
| 88 |
|
| 89 |
mock_cls = Mock()
|
| 90 |
+
mock_cls.return_value = {"corrected_html": corrected_html}
|
|
|
|
| 91 |
|
| 92 |
cell_processor = TableProcessor(detection_model, recognition_model, table_rec_model)
|
| 93 |
cell_processor(pdf_document)
|
| 94 |
|
| 95 |
+
processor = LLMTableProcessor(mock_cls, {"use_llm": True, "google_api_key": "test"})
|
| 96 |
processor(pdf_document)
|
| 97 |
|
| 98 |
tables = pdf_document.contained_blocks((BlockTypes.Table,))
|
|
|
|
| 107 |
@pytest.mark.config({"page_range": [0]})
|
| 108 |
def test_llm_caption_processor_disabled(pdf_document):
|
| 109 |
config = {"use_llm": True, "google_api_key": "test"}
|
| 110 |
+
mock_cls = MagicMock()
|
| 111 |
processor_lst = [LLMImageDescriptionProcessor(config)]
|
| 112 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, mock_cls, config)
|
| 113 |
processor(pdf_document)
|
| 114 |
|
| 115 |
contained_pictures = pdf_document.contained_blocks((BlockTypes.Picture, BlockTypes.Figure))
|
|
|
|
| 117 |
|
| 118 |
@pytest.mark.filename("A17_FlightPlan.pdf")
|
| 119 |
@pytest.mark.config({"page_range": [0]})
|
| 120 |
+
def test_llm_caption_processor(pdf_document):
|
| 121 |
description = "This is an image description."
|
| 122 |
mock_cls = Mock()
|
| 123 |
+
mock_cls.return_value = {"image_description": description}
|
|
|
|
| 124 |
|
| 125 |
config = {"use_llm": True, "google_api_key": "test", "extract_images": False}
|
| 126 |
processor_lst = [LLMImageDescriptionProcessor(config)]
|
| 127 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, mock_cls, config)
|
| 128 |
processor(pdf_document)
|
| 129 |
|
| 130 |
contained_pictures = pdf_document.contained_blocks((BlockTypes.Picture, BlockTypes.Figure))
|
|
|
|
| 139 |
|
| 140 |
@pytest.mark.filename("A17_FlightPlan.pdf")
|
| 141 |
@pytest.mark.config({"page_range": [0]})
|
| 142 |
+
def test_llm_complex_region_processor(pdf_document):
|
| 143 |
md = "This is some *markdown* for a complex region."
|
| 144 |
mock_cls = Mock()
|
| 145 |
+
mock_cls.return_value = {"corrected_markdown": md * 25}
|
|
|
|
| 146 |
|
| 147 |
# Replace the block with a complex region
|
| 148 |
old_block = pdf_document.pages[0].children[0]
|
|
|
|
| 154 |
# Test processor
|
| 155 |
config = {"use_llm": True, "google_api_key": "test"}
|
| 156 |
processor_lst = [LLMComplexRegionProcessor(config)]
|
| 157 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, mock_cls, config)
|
| 158 |
processor(pdf_document)
|
| 159 |
|
| 160 |
# Ensure the rendering includes the description
|
|
|
|
| 165 |
|
| 166 |
@pytest.mark.filename("adversarial.pdf")
|
| 167 |
@pytest.mark.config({"page_range": [0]})
|
| 168 |
+
def test_multi_llm_processors(pdf_document):
|
| 169 |
description = "<math>This is an image description. And here is a lot of writing about it.</math>" * 10
|
| 170 |
mock_cls = Mock()
|
| 171 |
+
mock_cls.return_value = {"image_description": description, "html_equation": description}
|
|
|
|
| 172 |
|
| 173 |
config = {"use_llm": True, "google_api_key": "test", "extract_images": False, "min_equation_height": .001}
|
| 174 |
processor_lst = [LLMImageDescriptionProcessor(config), LLMEquationProcessor(config)]
|
| 175 |
+
processor = LLMSimpleBlockMetaProcessor(processor_lst, mock_cls, config)
|
| 176 |
processor(pdf_document)
|
| 177 |
|
| 178 |
contained_pictures = pdf_document.contained_blocks((BlockTypes.Picture, BlockTypes.Figure))
|
tests/processors/test_table_merge.py
CHANGED
|
@@ -10,11 +10,10 @@ from marker.schema import BlockTypes
|
|
| 10 |
@pytest.mark.filename("table_ex2.pdf")
|
| 11 |
def test_llm_table_processor_nomerge(pdf_document, detection_model, table_rec_model, recognition_model, mocker):
|
| 12 |
mock_cls = Mock()
|
| 13 |
-
mock_cls.return_value
|
| 14 |
"merge": "true",
|
| 15 |
"direction": "right"
|
| 16 |
}
|
| 17 |
-
mocker.patch("marker.processors.llm.GoogleModel", mock_cls)
|
| 18 |
|
| 19 |
cell_processor = TableProcessor(detection_model, recognition_model, table_rec_model)
|
| 20 |
cell_processor(pdf_document)
|
|
@@ -22,7 +21,7 @@ def test_llm_table_processor_nomerge(pdf_document, detection_model, table_rec_mo
|
|
| 22 |
tables = pdf_document.contained_blocks((BlockTypes.Table,))
|
| 23 |
assert len(tables) == 3
|
| 24 |
|
| 25 |
-
processor = LLMTableMergeProcessor({"use_llm": True, "google_api_key": "test"})
|
| 26 |
processor(pdf_document)
|
| 27 |
|
| 28 |
tables = pdf_document.contained_blocks((BlockTypes.Table,))
|
|
|
|
| 10 |
@pytest.mark.filename("table_ex2.pdf")
|
| 11 |
def test_llm_table_processor_nomerge(pdf_document, detection_model, table_rec_model, recognition_model, mocker):
|
| 12 |
mock_cls = Mock()
|
| 13 |
+
mock_cls.return_value = {
|
| 14 |
"merge": "true",
|
| 15 |
"direction": "right"
|
| 16 |
}
|
|
|
|
| 17 |
|
| 18 |
cell_processor = TableProcessor(detection_model, recognition_model, table_rec_model)
|
| 19 |
cell_processor(pdf_document)
|
|
|
|
| 21 |
tables = pdf_document.contained_blocks((BlockTypes.Table,))
|
| 22 |
assert len(tables) == 3
|
| 23 |
|
| 24 |
+
processor = LLMTableMergeProcessor(mock_cls, {"use_llm": True, "google_api_key": "test"})
|
| 25 |
processor(pdf_document)
|
| 26 |
|
| 27 |
tables = pdf_document.contained_blocks((BlockTypes.Table,))
|