Spaces:

Ansemin101
/

Markit_v2

Runtime error

AnseMin commited on Jun 25, 2025

Commit

d437733

1 Parent(s): 615c16c

Enhance multi-document processing capabilities in parsers

- Implemented validation for batch file processing in both Docling and Mistral OCR parsers, ensuring size and type constraints are met.
- Added support for multi-document processing in Docling, allowing up to 5 files with a combined size limit of 20MB.
- Enhanced the `_create_batch_prompt` and `_format_batch_output` methods in both parsers to handle multiple documents effectively.
- Updated README to reflect new multi-document processing features and parser capabilities.

Files changed (3) hide show

README.md +10 -4
src/parsers/docling_parser.py +139 -0
src/parsers/mistral_ocr_parser.py +163 -0

README.md CHANGED Viewed

@@ -23,9 +23,10 @@ A Hugging Face Space that converts various document formats to Markdown and lets
 - **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
 - Multiple parser options:
   - MarkItDown: For comprehensive document conversion
-  - Docling: For advanced PDF understanding with table structure recognition
   - GOT-OCR: For image-based OCR with LaTeX support
   - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
 - **🆕 Intelligent Processing Types**:
   - **Combined**: Merge documents into unified content with duplicate removal
   - **Individual**: Separate sections per document with clear organization
@@ -61,14 +62,16 @@ A Hugging Face Space that converts various document formats to Markdown and lets
 **MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
-**Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis.
 **Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
 ## 🚀 Multi-Document Processing
 ### **What makes this special?**
-Markit v2 introduces **industry-leading multi-document processing** powered by Google's Gemini Flash 2.5, enabling intelligent analysis across multiple documents simultaneously.
 ### **Key Capabilities:**
 - **📊 Cross-Document Analysis**: Compare and contrast information across different files
@@ -181,7 +184,10 @@ The application uses centralized configuration management. You can enhance funct
    - **Individual**: Keep documents separate with clear section headers
    - **Summary**: Executive overview + detailed analysis of each document
    - **Comparison**: Side-by-side analysis with similarities/differences tables
-5. Choose your preferred parser (recommend **Gemini Flash** for best multi-document results)
 6. Click "Convert"
 7. Get intelligent cross-document analysis and download enhanced output

 - **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
 - Multiple parser options:
   - MarkItDown: For comprehensive document conversion
+  - Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
   - GOT-OCR: For image-based OCR with LaTeX support
   - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
+  - Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
 - **🆕 Intelligent Processing Types**:
   - **Combined**: Merge documents into unified content with duplicate removal
   - **Individual**: Separate sections per document with clear organization
 **MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
+**Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis. **Supports multi-document processing** with Gemini-powered summary & comparison.
 **Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
+**Mistral OCR**: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode. **Supports multi-document processing** with Gemini-powered summary & comparison.
 ## 🚀 Multi-Document Processing
 ### **What makes this special?**
+Markit v2 introduces **industry-leading multi-document processing** with **three powerful parser options**: Gemini Flash (native multi-document AI), Mistral OCR (high-accuracy with Document Understanding), and Docling (advanced PDF analysis). All support intelligent cross-document analysis.
 ### **Key Capabilities:**
 - **📊 Cross-Document Analysis**: Compare and contrast information across different files
    - **Individual**: Keep documents separate with clear section headers
    - **Summary**: Executive overview + detailed analysis of each document
    - **Comparison**: Side-by-side analysis with similarities/differences tables
+5. Choose your preferred parser:
+   - **Gemini Flash**: Best for advanced cross-document reasoning and native multi-document support
+   - **Mistral OCR**: Great for high-accuracy OCR with Document Understanding mode
+   - **Docling**: Excellent for PDF table structure + multi-document analysis
 6. Click "Convert"
 7. Get intelligent cross-document analysis and download enhanced output

src/parsers/docling_parser.py CHANGED Viewed

@@ -8,6 +8,7 @@ import tempfile
 from src.parsers.parser_interface import DocumentParser
 from src.parsers.parser_registry import ParserRegistry
 from src.core.exceptions import DocumentProcessingError, ParserError
 # Check for Docling availability
 try:
@@ -20,6 +21,13 @@ except ImportError:
     HAS_DOCLING = False
     logging.warning("Docling package not installed. Please install with 'pip install docling'")
 # Configure logging
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
@@ -199,6 +207,137 @@ class DoclingParser(DocumentParser):
     def get_description(cls) -> str:
         return "Docling parser with advanced PDF understanding, table structure recognition, and multiple OCR engines"
 # Register the parser with the registry if available
 if HAS_DOCLING:

 from src.parsers.parser_interface import DocumentParser
 from src.parsers.parser_registry import ParserRegistry
 from src.core.exceptions import DocumentProcessingError, ParserError
+from src.core.config import config
 # Check for Docling availability
 try:
     HAS_DOCLING = False
     logging.warning("Docling package not installed. Please install with 'pip install docling'")
+# Gemini availability
+try:
+    from google import genai
+    HAS_GEMINI = True
+except ImportError:
+    HAS_GEMINI = False
 # Configure logging
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
     def get_description(cls) -> str:
         return "Docling parser with advanced PDF understanding, table structure recognition, and multiple OCR engines"
+    def _validate_batch_files(self, file_paths: List[Path]) -> None:
+        """Validate batch of files (size, count, type) for multi-document processing."""
+        if len(file_paths) == 0:
+            raise DocumentProcessingError("No files provided for processing")
+        if len(file_paths) > 5:
+            raise DocumentProcessingError("Maximum 5 files allowed for batch processing")
+        total_size = 0
+        for fp in file_paths:
+            if not fp.exists():
+                raise DocumentProcessingError(f"File not found: {fp}")
+            size = fp.stat().st_size
+            if size > 10 * 1024 * 1024:  # 10 MB
+                raise DocumentProcessingError(f"Individual file size exceeds 10MB: {fp.name}")
+            total_size += size
+        if total_size > 20 * 1024 * 1024:
+            raise DocumentProcessingError(f"Combined file size ({total_size / (1024*1024):.1f}MB) exceeds 20MB limit")
+    def _create_batch_prompt(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
+        """Create a natural-language prompt for Gemini post-processing."""
+        names = original_filenames if original_filenames else [p.name for p in file_paths]
+        file_list = "\n".join(f"- {n}" for n in names)
+        base = f"I will provide you with {len(file_paths)} documents:\n{file_list}\n\n"
+        if processing_type == "combined":
+            return base + "Merge the content into a single coherent markdown document, preserving structure."
+        if processing_type == "individual":
+            return base + "Convert each document to markdown under its own heading."
+        if processing_type == "summary":
+            return base + "Create an EXECUTIVE SUMMARY followed by detailed markdown conversions per document."
+        if processing_type == "comparison":
+            return base + "Provide a comparison table of the documents, individual summaries, and cross-document insights."
+        # default fallback
+        return base
+    def _format_batch_output(self, response_text: str, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
+        names = original_filenames if original_filenames else [p.name for p in file_paths]
+        header = (
+            f"<!-- Multi-Document Processing Results -->\n"
+            f"<!-- Processing Type: {processing_type} -->\n"
+            f"<!-- Files Processed: {len(file_paths)} -->\n"
+            f"<!-- File Names: {', '.join(names)} -->\n\n"
+        )
+        # Ensure response_text is a string to avoid TypeError when it is None
+        safe_resp = "" if response_text is None else str(response_text)
+        return header + safe_resp
+    def _convert_batch_with_docling(self, paths: List[Path], ocr_method: Optional[str], **kwargs) -> List[str]:
+        """Run Docling conversion on a list of Paths and return markdown list."""
+        if self._check_cancellation():
+            raise DocumentProcessingError("Conversion cancelled")
+        # Select converter (respecting OCR method if set)
+        if ocr_method and ocr_method != "docling_default":
+            converter = self._create_converter_with_options(ocr_method, **kwargs)
+        else:
+            converter = self.converter
+        if converter is None:
+            raise DocumentProcessingError("Docling converter not initialized")
+        # Convert all docs
+        from docling.datamodel.base_models import ConversionStatus
+        markdown_results: List[str] = []
+        conv_results = converter.convert_all([str(p) for p in paths], raises_on_error=False)
+        for idx, conv_res in enumerate(conv_results):
+            if self._check_cancellation():
+                raise DocumentProcessingError("Conversion cancelled")
+            if conv_res.status in (ConversionStatus.SUCCESS, ConversionStatus.PARTIAL_SUCCESS):
+                markdown_results.append(conv_res.document.export_to_markdown())
+            else:
+                raise DocumentProcessingError(f"Docling failed to convert {paths[idx].name}")
+        return markdown_results
+    def parse_multiple(
+        self,
+        file_paths: List[Union[str, Path]],
+        processing_type: str = "combined",
+        original_filenames: Optional[List[str]] = None,
+        ocr_method: Optional[str] = None,
+        output_format: str = "markdown",
+        **kwargs,
+    ) -> str:
+        """Multi-document processing using Docling + optional Gemini summarisation/comparison."""
+        if not HAS_DOCLING:
+            raise ParserError("Docling package not installed")
+        paths = [Path(p) for p in file_paths]
+        self._validate_batch_files(paths)
+        # Run Docling conversion
+        markdown_list = self._convert_batch_with_docling(paths, ocr_method, **kwargs)
+        # LOCAL composition for combined/individual
+        if processing_type in ("combined", "individual"):
+            if processing_type == "individual":
+                names = original_filenames if original_filenames else [p.name for p in paths]
+                sections = [f"# Document {i+1}: {n}\n\n{md}" for i, (n, md) in enumerate(zip(names, markdown_list), 1)]
+                combined = "\n\n---\n\n".join(sections)
+            else:
+                combined = "\n\n---\n\n".join(markdown_list)
+            return self._format_batch_output(combined, paths, processing_type, original_filenames)
+        # SUMMARY / COMPARISON → Gemini 2.5 Flash
+        if not HAS_GEMINI or not config.api.google_api_key:
+            raise DocumentProcessingError("Gemini API not available for summary/comparison post-processing")
+        prompt = self._create_batch_prompt(paths, processing_type, original_filenames)
+        combined_md = "\n\n---\n\n".join(markdown_list)
+        try:
+            client = genai.Client(api_key=config.api.google_api_key)
+            response = client.models.generate_content(
+                model=config.model.gemini_model,
+                contents=[prompt, combined_md],
+                config={
+                    "temperature": config.model.temperature,
+                    "top_p": 0.95,
+                    "top_k": 40,
+                    "max_output_tokens": config.model.max_tokens,
+                },
+            )
+            final_text = response.text if hasattr(response, "text") else None
+            if final_text is None:
+                raise DocumentProcessingError("Gemini post-processing returned no text")
+        except Exception as e:
+            raise DocumentProcessingError(f"Gemini post-processing failed: {str(e)}")
+        return self._format_batch_output(final_text, paths, processing_type, original_filenames)
 # Register the parser with the registry if available
 if HAS_DOCLING:

src/parsers/mistral_ocr_parser.py CHANGED Viewed

@@ -357,7 +357,170 @@ class MistralOcrParser(DocumentParser):
         return markdown
 # Register the parser with the registry
 if MISTRAL_AVAILABLE:

         return markdown
+    def _validate_batch_files(self, file_paths: List[Path]) -> None:
+        """Validate batch of files for multi-document processing."""
+        if len(file_paths) == 0:
+            raise DocumentProcessingError("No files provided for processing")
+        if len(file_paths) > 5:
+            raise DocumentProcessingError("Maximum 5 files allowed for batch processing")
+        total_size = 0
+        for fp in file_paths:
+            if not fp.exists():
+                raise DocumentProcessingError(f"File not found: {fp}")
+            size = fp.stat().st_size
+            if size > 10 * 1024 * 1024:
+                raise DocumentProcessingError(f"Individual file size exceeds 10MB: {fp.name}")
+            total_size += size
+        if total_size > 20 * 1024 * 1024:
+            raise DocumentProcessingError(f"Combined file size ({total_size / (1024*1024):.1f}MB) exceeds 20MB limit")
+        # simple mime validation
+        for fp in file_paths:
+            if self._get_mime_type(fp.suffix.lower()) == "application/octet-stream":
+                raise DocumentProcessingError(f"Unsupported file type: {fp.name}")
+    def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
+        """Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
+        ext = file_path.suffix.lower()
+        if ext == '.pdf':
+            # upload and get signed url
+            client = Mistral(api_key=config.api.mistral_api_key)
+            uploaded = client.files.upload(
+                file={
+                    "file_name": file_path.name,
+                    "content": open(file_path, "rb"),
+                },
+                purpose="ocr",
+            )
+            signed = client.files.get_signed_url(file_id=uploaded.id)
+            return {
+                "type": "document_url",
+                "document_url": signed.url,
+            }
+        else:
+            # encode image
+            b64 = self.encode_image(file_path)
+            mime = self._get_mime_type(ext)
+            return {
+                "type": "image_url",
+                "image_url": {
+                    "url": f"data:{mime};base64,{b64}"
+                }
+            }
+    def _create_batch_prompt(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
+        if original_filenames:
+            names = original_filenames
+        else:
+            names = [fp.name for fp in file_paths]
+        file_list = "\n".join([f"- {name}" for name in names])
+        base = f"I will provide you with {len(file_paths)} documents.\n{file_list}\n\n"
+        if processing_type == "individual":
+            return base + "Please convert each document to markdown as its own section, preserving structure."
+        if processing_type == "summary":
+            return base + (
+                "Please first write an EXECUTIVE SUMMARY of all documents, then include converted markdown sections per document."
+            )
+        if processing_type == "comparison":
+            return base + (
+                "Please provide a comparison table of the documents, then individual summaries and cross-document insights."
+            )
+        # default combined
+        return base + "Please merge the content of all documents into a single cohesive markdown document."
+    def _format_batch_output(self, response_text: str, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> str:
+        if original_filenames:
+            names = original_filenames
+        else:
+            names = [fp.name for fp in file_paths]
+        header = (
+            f"<!-- Multi-Document Processing Results -->\n"
+            f"<!-- Processing Type: {processing_type} -->\n"
+            f"<!-- Files Processed: {len(file_paths)} -->\n"
+            f"<!-- File Names: {', '.join(names)} -->\n\n"
+        )
+        return header + response_text
+    def parse_multiple(
+        self,
+        file_paths: List[Union[str, Path]],
+        processing_type: str = "combined",
+        original_filenames: Optional[List[str]] = None,
+        ocr_method: Optional[str] = None,
+        output_format: str = "markdown",
+        **kwargs,
+    ) -> str:
+        """Parse multiple documents, supporting the same processing types as Gemini parser."""
+        if not MISTRAL_AVAILABLE:
+            raise DocumentProcessingError("Mistral client not installed. Install with 'pip install mistralai'.")
+        if not config.api.mistral_api_key:
+            raise DocumentProcessingError("MISTRAL_API_KEY not set.")
+        try:
+            # convert to Path objects
+            paths = [Path(p) for p in file_paths]
+            self._validate_batch_files(paths)
+            if self._check_cancellation():
+                return "Conversion cancelled."
+            use_understanding = ocr_method == "understand"
+            client = Mistral(api_key=config.api.mistral_api_key)
+            if use_understanding:
+                # Build chat content with document parts
+                prompt = self._create_batch_prompt(paths, processing_type, original_filenames)
+                content_parts = [
+                    {"type": "text", "text": prompt},
+                ]
+                for p in paths:
+                    if self._check_cancellation():
+                        return "Conversion cancelled."
+                    content_parts.append(self._create_document_part(p))
+                chat_response = client.chat.complete(
+                    model="mistral-large-latest",
+                    max_tokens=config.model.max_tokens,
+                    temperature=config.model.temperature,
+                    messages=[{"role": "user", "content": content_parts}],
+                )
+                markdown_text = chat_response.choices[0].message.content
+                return self._format_batch_output(markdown_text, paths, processing_type, original_filenames)
+            # else basic OCR path
+            results = []
+            for idx, p in enumerate(paths):
+                if self._check_cancellation():
+                    return "Conversion cancelled."
+                text = self._extract_with_ocr(client, p, p.suffix.lower())
+                if processing_type == "individual":
+                    name = (original_filenames[idx] if original_filenames else p.name)
+                    text = f"# Document {idx+1}: {name}\n\n" + text
+                results.append(text)
+            combined_md = "\n\n---\n\n".join(results) if processing_type in ["individual", "combined"] else "\n\n".join(results)
+            # For summary/comparison we now ask chat to summarise
+            if processing_type in ["summary", "comparison"]:
+                prompt = self._create_batch_prompt(paths, processing_type, original_filenames)
+                chat_response = client.chat.complete(
+                    model="mistral-large-latest",
+                    max_tokens=config.model.max_tokens,
+                    temperature=config.model.temperature,
+                    messages=[
+                        {"role": "user", "content": prompt + "\n\n" + combined_md}
+                    ],
+                )
+                combined_md = chat_response.choices[0].message.content
+            return self._format_batch_output(combined_md, paths, processing_type, original_filenames)
+        except Exception as e:
+            logger.error(f"Error parsing multiple documents with Mistral OCR: {str(e)}")
+            raise DocumentProcessingError(f"Batch processing failed: {str(e)}")
 # Register the parser with the registry
 if MISTRAL_AVAILABLE: