AnseMin commited on
Commit
d437733
Β·
1 Parent(s): 615c16c

Enhance multi-document processing capabilities in parsers

Browse files

- Implemented validation for batch file processing in both Docling and Mistral OCR parsers, ensuring size and type constraints are met.
- Added support for multi-document processing in Docling, allowing up to 5 files with a combined size limit of 20MB.
- Enhanced the `_create_batch_prompt` and `_format_batch_output` methods in both parsers to handle multiple documents effectively.
- Updated README to reflect new multi-document processing features and parser capabilities.

README.md CHANGED
@@ -23,9 +23,10 @@ A Hugging Face Space that converts various document formats to Markdown and lets
23
  - **πŸ†• Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
24
  - Multiple parser options:
25
  - MarkItDown: For comprehensive document conversion
26
- - Docling: For advanced PDF understanding with table structure recognition
27
  - GOT-OCR: For image-based OCR with LaTeX support
28
  - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
 
29
  - **πŸ†• Intelligent Processing Types**:
30
  - **Combined**: Merge documents into unified content with duplicate removal
31
  - **Individual**: Separate sections per document with clear organization
@@ -61,14 +62,16 @@ A Hugging Face Space that converts various document formats to Markdown and lets
61
 
62
  **MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
63
 
64
- **Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis.
65
 
66
  **Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
67
 
 
 
68
  ## πŸš€ Multi-Document Processing
69
 
70
  ### **What makes this special?**
71
- Markit v2 introduces **industry-leading multi-document processing** powered by Google's Gemini Flash 2.5, enabling intelligent analysis across multiple documents simultaneously.
72
 
73
  ### **Key Capabilities:**
74
  - **πŸ“Š Cross-Document Analysis**: Compare and contrast information across different files
@@ -181,7 +184,10 @@ The application uses centralized configuration management. You can enhance funct
181
  - **Individual**: Keep documents separate with clear section headers
182
  - **Summary**: Executive overview + detailed analysis of each document
183
  - **Comparison**: Side-by-side analysis with similarities/differences tables
184
- 5. Choose your preferred parser (recommend **Gemini Flash** for best multi-document results)
 
 
 
185
  6. Click "Convert"
186
  7. Get intelligent cross-document analysis and download enhanced output
187
 
 
23
  - **πŸ†• Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
24
  - Multiple parser options:
25
  - MarkItDown: For comprehensive document conversion
26
+ - Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
27
  - GOT-OCR: For image-based OCR with LaTeX support
28
  - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
29
+ - Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
30
  - **πŸ†• Intelligent Processing Types**:
31
  - **Combined**: Merge documents into unified content with duplicate removal
32
  - **Individual**: Separate sections per document with clear organization
 
62
 
63
  **MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
64
 
65
+ **Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis. **Supports multi-document processing** with Gemini-powered summary & comparison.
66
 
67
  **Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
68
 
69
+ **Mistral OCR**: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode. **Supports multi-document processing** with Gemini-powered summary & comparison.
70
+
71
  ## πŸš€ Multi-Document Processing
72
 
73
  ### **What makes this special?**
74
+ Markit v2 introduces **industry-leading multi-document processing** with **three powerful parser options**: Gemini Flash (native multi-document AI), Mistral OCR (high-accuracy with Document Understanding), and Docling (advanced PDF analysis). All support intelligent cross-document analysis.
75
 
76
  ### **Key Capabilities:**
77
  - **πŸ“Š Cross-Document Analysis**: Compare and contrast information across different files
 
184
  - **Individual**: Keep documents separate with clear section headers
185
  - **Summary**: Executive overview + detailed analysis of each document
186
  - **Comparison**: Side-by-side analysis with similarities/differences tables
187
+ 5. Choose your preferred parser:
188
+ - **Gemini Flash**: Best for advanced cross-document reasoning and native multi-document support
189
+ - **Mistral OCR**: Great for high-accuracy OCR with Document Understanding mode
190
+ - **Docling**: Excellent for PDF table structure + multi-document analysis
191
  6. Click "Convert"
192
  7. Get intelligent cross-document analysis and download enhanced output
193
 
src/parsers/docling_parser.py CHANGED
@@ -8,6 +8,7 @@ import tempfile
8
  from src.parsers.parser_interface import DocumentParser
9
  from src.parsers.parser_registry import ParserRegistry
10
  from src.core.exceptions import DocumentProcessingError, ParserError
 
11
 
12
  # Check for Docling availability
13
  try:
@@ -20,6 +21,13 @@ except ImportError:
20
  HAS_DOCLING = False
21
  logging.warning("Docling package not installed. Please install with 'pip install docling'")
22
 
 
 
 
 
 
 
 
23
  # Configure logging
24
  logger = logging.getLogger(__name__)
25
  logger.setLevel(logging.DEBUG)
@@ -199,6 +207,137 @@ class DoclingParser(DocumentParser):
199
  def get_description(cls) -> str:
200
  return "Docling parser with advanced PDF understanding, table structure recognition, and multiple OCR engines"
201
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
  # Register the parser with the registry if available
204
  if HAS_DOCLING:
 
8
  from src.parsers.parser_interface import DocumentParser
9
  from src.parsers.parser_registry import ParserRegistry
10
  from src.core.exceptions import DocumentProcessingError, ParserError
11
+ from src.core.config import config
12
 
13
  # Check for Docling availability
14
  try:
 
21
  HAS_DOCLING = False
22
  logging.warning("Docling package not installed. Please install with 'pip install docling'")
23
 
24
+ # Gemini availability
25
+ try:
26
+ from google import genai
27
+ HAS_GEMINI = True
28
+ except ImportError:
29
+ HAS_GEMINI = False
30
+
31
  # Configure logging
32
  logger = logging.getLogger(__name__)
33
  logger.setLevel(logging.DEBUG)
 
207
  def get_description(cls) -> str:
208
  return "Docling parser with advanced PDF understanding, table structure recognition, and multiple OCR engines"
209
 
210
+ def _validate_batch_files(self, file_paths: List[Path]) -> None:
211
+ """Validate batch of files (size, count, type) for multi-document processing."""
212
+ if len(file_paths) == 0:
213
+ raise DocumentProcessingError("No files provided for processing")
214
+ if len(file_paths) > 5:
215
+ raise DocumentProcessingError("Maximum 5 files allowed for batch processing")
216
+
217
+ total_size = 0
218
+ for fp in file_paths:
219
+ if not fp.exists():
220
+ raise DocumentProcessingError(f"File not found: {fp}")
221
+ size = fp.stat().st_size
222
+ if size > 10 * 1024 * 1024: # 10 MB
223
+ raise DocumentProcessingError(f"Individual file size exceeds 10MB: {fp.name}")
224
+ total_size += size
225
+ if total_size > 20 * 1024 * 1024:
226
+ raise DocumentProcessingError(f"Combined file size ({total_size / (1024*1024):.1f}MB) exceeds 20MB limit")
227
+
228
+ def _create_batch_prompt(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
229
+ """Create a natural-language prompt for Gemini post-processing."""
230
+ names = original_filenames if original_filenames else [p.name for p in file_paths]
231
+ file_list = "\n".join(f"- {n}" for n in names)
232
+ base = f"I will provide you with {len(file_paths)} documents:\n{file_list}\n\n"
233
+ if processing_type == "combined":
234
+ return base + "Merge the content into a single coherent markdown document, preserving structure."
235
+ if processing_type == "individual":
236
+ return base + "Convert each document to markdown under its own heading."
237
+ if processing_type == "summary":
238
+ return base + "Create an EXECUTIVE SUMMARY followed by detailed markdown conversions per document."
239
+ if processing_type == "comparison":
240
+ return base + "Provide a comparison table of the documents, individual summaries, and cross-document insights."
241
+ # default fallback
242
+ return base
243
+
244
+ def _format_batch_output(self, response_text: str, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
245
+ names = original_filenames if original_filenames else [p.name for p in file_paths]
246
+ header = (
247
+ f"<!-- Multi-Document Processing Results -->\n"
248
+ f"<!-- Processing Type: {processing_type} -->\n"
249
+ f"<!-- Files Processed: {len(file_paths)} -->\n"
250
+ f"<!-- File Names: {', '.join(names)} -->\n\n"
251
+ )
252
+ # Ensure response_text is a string to avoid TypeError when it is None
253
+ safe_resp = "" if response_text is None else str(response_text)
254
+ return header + safe_resp
255
+
256
+ def _convert_batch_with_docling(self, paths: List[Path], ocr_method: Optional[str], **kwargs) -> List[str]:
257
+ """Run Docling conversion on a list of Paths and return markdown list."""
258
+ if self._check_cancellation():
259
+ raise DocumentProcessingError("Conversion cancelled")
260
+
261
+ # Select converter (respecting OCR method if set)
262
+ if ocr_method and ocr_method != "docling_default":
263
+ converter = self._create_converter_with_options(ocr_method, **kwargs)
264
+ else:
265
+ converter = self.converter
266
+
267
+ if converter is None:
268
+ raise DocumentProcessingError("Docling converter not initialized")
269
+
270
+ # Convert all docs
271
+ from docling.datamodel.base_models import ConversionStatus
272
+ markdown_results: List[str] = []
273
+ conv_results = converter.convert_all([str(p) for p in paths], raises_on_error=False)
274
+
275
+ for idx, conv_res in enumerate(conv_results):
276
+ if self._check_cancellation():
277
+ raise DocumentProcessingError("Conversion cancelled")
278
+
279
+ if conv_res.status in (ConversionStatus.SUCCESS, ConversionStatus.PARTIAL_SUCCESS):
280
+ markdown_results.append(conv_res.document.export_to_markdown())
281
+ else:
282
+ raise DocumentProcessingError(f"Docling failed to convert {paths[idx].name}")
283
+ return markdown_results
284
+
285
+ def parse_multiple(
286
+ self,
287
+ file_paths: List[Union[str, Path]],
288
+ processing_type: str = "combined",
289
+ original_filenames: Optional[List[str]] = None,
290
+ ocr_method: Optional[str] = None,
291
+ output_format: str = "markdown",
292
+ **kwargs,
293
+ ) -> str:
294
+ """Multi-document processing using Docling + optional Gemini summarisation/comparison."""
295
+ if not HAS_DOCLING:
296
+ raise ParserError("Docling package not installed")
297
+
298
+ paths = [Path(p) for p in file_paths]
299
+ self._validate_batch_files(paths)
300
+
301
+ # Run Docling conversion
302
+ markdown_list = self._convert_batch_with_docling(paths, ocr_method, **kwargs)
303
+
304
+ # LOCAL composition for combined/individual
305
+ if processing_type in ("combined", "individual"):
306
+ if processing_type == "individual":
307
+ names = original_filenames if original_filenames else [p.name for p in paths]
308
+ sections = [f"# Document {i+1}: {n}\n\n{md}" for i, (n, md) in enumerate(zip(names, markdown_list), 1)]
309
+ combined = "\n\n---\n\n".join(sections)
310
+ else:
311
+ combined = "\n\n---\n\n".join(markdown_list)
312
+ return self._format_batch_output(combined, paths, processing_type, original_filenames)
313
+
314
+ # SUMMARY / COMPARISON β†’ Gemini 2.5 Flash
315
+ if not HAS_GEMINI or not config.api.google_api_key:
316
+ raise DocumentProcessingError("Gemini API not available for summary/comparison post-processing")
317
+
318
+ prompt = self._create_batch_prompt(paths, processing_type, original_filenames)
319
+ combined_md = "\n\n---\n\n".join(markdown_list)
320
+
321
+ try:
322
+ client = genai.Client(api_key=config.api.google_api_key)
323
+ response = client.models.generate_content(
324
+ model=config.model.gemini_model,
325
+ contents=[prompt, combined_md],
326
+ config={
327
+ "temperature": config.model.temperature,
328
+ "top_p": 0.95,
329
+ "top_k": 40,
330
+ "max_output_tokens": config.model.max_tokens,
331
+ },
332
+ )
333
+ final_text = response.text if hasattr(response, "text") else None
334
+ if final_text is None:
335
+ raise DocumentProcessingError("Gemini post-processing returned no text")
336
+ except Exception as e:
337
+ raise DocumentProcessingError(f"Gemini post-processing failed: {str(e)}")
338
+
339
+ return self._format_batch_output(final_text, paths, processing_type, original_filenames)
340
+
341
 
342
  # Register the parser with the registry if available
343
  if HAS_DOCLING:
src/parsers/mistral_ocr_parser.py CHANGED
@@ -357,7 +357,170 @@ class MistralOcrParser(DocumentParser):
357
 
358
  return markdown
359
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
360
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
361
 
362
  # Register the parser with the registry
363
  if MISTRAL_AVAILABLE:
 
357
 
358
  return markdown
359
 
360
+
361
+
362
+ def _validate_batch_files(self, file_paths: List[Path]) -> None:
363
+ """Validate batch of files for multi-document processing."""
364
+ if len(file_paths) == 0:
365
+ raise DocumentProcessingError("No files provided for processing")
366
+ if len(file_paths) > 5:
367
+ raise DocumentProcessingError("Maximum 5 files allowed for batch processing")
368
+
369
+ total_size = 0
370
+ for fp in file_paths:
371
+ if not fp.exists():
372
+ raise DocumentProcessingError(f"File not found: {fp}")
373
+ size = fp.stat().st_size
374
+ if size > 10 * 1024 * 1024:
375
+ raise DocumentProcessingError(f"Individual file size exceeds 10MB: {fp.name}")
376
+ total_size += size
377
+ if total_size > 20 * 1024 * 1024:
378
+ raise DocumentProcessingError(f"Combined file size ({total_size / (1024*1024):.1f}MB) exceeds 20MB limit")
379
+
380
+ # simple mime validation
381
+ for fp in file_paths:
382
+ if self._get_mime_type(fp.suffix.lower()) == "application/octet-stream":
383
+ raise DocumentProcessingError(f"Unsupported file type: {fp.name}")
384
+
385
+ def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
386
+ """Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
387
+ ext = file_path.suffix.lower()
388
+ if ext == '.pdf':
389
+ # upload and get signed url
390
+ client = Mistral(api_key=config.api.mistral_api_key)
391
+ uploaded = client.files.upload(
392
+ file={
393
+ "file_name": file_path.name,
394
+ "content": open(file_path, "rb"),
395
+ },
396
+ purpose="ocr",
397
+ )
398
+ signed = client.files.get_signed_url(file_id=uploaded.id)
399
+ return {
400
+ "type": "document_url",
401
+ "document_url": signed.url,
402
+ }
403
+ else:
404
+ # encode image
405
+ b64 = self.encode_image(file_path)
406
+ mime = self._get_mime_type(ext)
407
+ return {
408
+ "type": "image_url",
409
+ "image_url": {
410
+ "url": f"data:{mime};base64,{b64}"
411
+ }
412
+ }
413
+
414
+ def _create_batch_prompt(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
415
+ if original_filenames:
416
+ names = original_filenames
417
+ else:
418
+ names = [fp.name for fp in file_paths]
419
+ file_list = "\n".join([f"- {name}" for name in names])
420
+ base = f"I will provide you with {len(file_paths)} documents.\n{file_list}\n\n"
421
+ if processing_type == "individual":
422
+ return base + "Please convert each document to markdown as its own section, preserving structure."
423
+ if processing_type == "summary":
424
+ return base + (
425
+ "Please first write an EXECUTIVE SUMMARY of all documents, then include converted markdown sections per document."
426
+ )
427
+ if processing_type == "comparison":
428
+ return base + (
429
+ "Please provide a comparison table of the documents, then individual summaries and cross-document insights."
430
+ )
431
+ # default combined
432
+ return base + "Please merge the content of all documents into a single cohesive markdown document."
433
+
434
+ def _format_batch_output(self, response_text: str, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
435
+ if original_filenames:
436
+ names = original_filenames
437
+ else:
438
+ names = [fp.name for fp in file_paths]
439
+ header = (
440
+ f"<!-- Multi-Document Processing Results -->\n"
441
+ f"<!-- Processing Type: {processing_type} -->\n"
442
+ f"<!-- Files Processed: {len(file_paths)} -->\n"
443
+ f"<!-- File Names: {', '.join(names)} -->\n\n"
444
+ )
445
+ return header + response_text
446
+
447
+ def parse_multiple(
448
+ self,
449
+ file_paths: List[Union[str, Path]],
450
+ processing_type: str = "combined",
451
+ original_filenames: Optional[List[str]] = None,
452
+ ocr_method: Optional[str] = None,
453
+ output_format: str = "markdown",
454
+ **kwargs,
455
+ ) -> str:
456
+ """Parse multiple documents, supporting the same processing types as Gemini parser."""
457
+ if not MISTRAL_AVAILABLE:
458
+ raise DocumentProcessingError("Mistral client not installed. Install with 'pip install mistralai'.")
459
+ if not config.api.mistral_api_key:
460
+ raise DocumentProcessingError("MISTRAL_API_KEY not set.")
461
+
462
+ try:
463
+ # convert to Path objects
464
+ paths = [Path(p) for p in file_paths]
465
+ self._validate_batch_files(paths)
466
+
467
+ if self._check_cancellation():
468
+ return "Conversion cancelled."
469
+
470
+ use_understanding = ocr_method == "understand"
471
+ client = Mistral(api_key=config.api.mistral_api_key)
472
+
473
+ if use_understanding:
474
+ # Build chat content with document parts
475
+ prompt = self._create_batch_prompt(paths, processing_type, original_filenames)
476
+ content_parts = [
477
+ {"type": "text", "text": prompt},
478
+ ]
479
+ for p in paths:
480
+ if self._check_cancellation():
481
+ return "Conversion cancelled."
482
+ content_parts.append(self._create_document_part(p))
483
+
484
+ chat_response = client.chat.complete(
485
+ model="mistral-large-latest",
486
+ max_tokens=config.model.max_tokens,
487
+ temperature=config.model.temperature,
488
+ messages=[{"role": "user", "content": content_parts}],
489
+ )
490
+ markdown_text = chat_response.choices[0].message.content
491
+ return self._format_batch_output(markdown_text, paths, processing_type, original_filenames)
492
+
493
+ # else basic OCR path
494
+ results = []
495
+ for idx, p in enumerate(paths):
496
+ if self._check_cancellation():
497
+ return "Conversion cancelled."
498
+ text = self._extract_with_ocr(client, p, p.suffix.lower())
499
+ if processing_type == "individual":
500
+ name = (original_filenames[idx] if original_filenames else p.name)
501
+ text = f"# Document {idx+1}: {name}\n\n" + text
502
+ results.append(text)
503
+
504
+ combined_md = "\n\n---\n\n".join(results) if processing_type in ["individual", "combined"] else "\n\n".join(results)
505
 
506
+ # For summary/comparison we now ask chat to summarise
507
+ if processing_type in ["summary", "comparison"]:
508
+ prompt = self._create_batch_prompt(paths, processing_type, original_filenames)
509
+ chat_response = client.chat.complete(
510
+ model="mistral-large-latest",
511
+ max_tokens=config.model.max_tokens,
512
+ temperature=config.model.temperature,
513
+ messages=[
514
+ {"role": "user", "content": prompt + "\n\n" + combined_md}
515
+ ],
516
+ )
517
+ combined_md = chat_response.choices[0].message.content
518
+
519
+ return self._format_batch_output(combined_md, paths, processing_type, original_filenames)
520
+
521
+ except Exception as e:
522
+ logger.error(f"Error parsing multiple documents with Mistral OCR: {str(e)}")
523
+ raise DocumentProcessingError(f"Batch processing failed: {str(e)}")
524
 
525
  # Register the parser with the registry
526
  if MISTRAL_AVAILABLE: