mcp_ocr_json

Sleeping

App Files Files Community

Vachudev commited on Dec 4, 2025

Commit

11b7950

verified ·

1 Parent(s): 041c268

update pdf preprocessing

Browse files

Files changed (1) hide show

ocr_preprocessing_engine.py +20 -219

ocr_preprocessing_engine.py CHANGED Viewed

@@ -1,11 +1,3 @@
-Based on the "Make OCR Actually Work" video, the failure of OCR is often due to skipping four specific preprocessing steps: **Normalization, Denoising, Deskewing, and Thresholding**. The video demonstrates that even advanced Transformer models fail if an image is rotated or has poor contrast.
-Here is the updated modular pipeline. I have rewritten `ocr_preprocessing_engine.py` to strictly implement the 4-step workflow highlighted in the video, and refined the `prompts.py` to take advantage of the cleaner text output.
-### 1. Improved `ocr_preprocessing_engine.py`
-**Changes:** Added explicit **Normalization** (contrast stretching) and **Denoising** steps before Binarization, as emphasized in the video source.
-```python
 import cv2
 import numpy as np
 import pytesseract
@@ -16,77 +8,72 @@ import logging
 logger = logging.getLogger("ocr_preprocessor")
-def preprocess_image(image: Image.Image) -> Image.Image:
     """
-    Implements the 4-step pipeline from the 'Make OCR Work' video source:
     1. Normalization (Contrast Stretching)
-    2. Denoising
-    3. Deskewing
     4. Thresholding (Binarization)
     """
     # Convert PIL to OpenCV format
     img_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
-    # 1. Normalization: Stretch pixel intensity to 0-255 range
-    # This fixes images that look "washed out" or "completely black" due to bad contrast.
-    norm_img = np.zeros((img_cv.shape, img_cv.shape), dtype=np.uint8)
     img_cv = cv2.normalize(img_cv, norm_img, 0, 255, cv2.NORM_MINMAX)
-    # Convert to Grayscale for further processing
     gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
-    # 2. Denoising: Remove speckles/artifacts
-    # fastNlMeans is effective but slow; using GaussianBlur as a faster CPU-friendly alternative
     denoised = cv2.GaussianBlur(gray, (5, 5), 0)
-    # 3. Thresholding (Binarization)
-    # The video suggests finding the right value. Otsu's method (THRESH_OTSU) automatically
-    # finds the optimal threshold value to separate text (foreground) from background.
     _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
-    # 4. Deskewing
-    # The video notes that without rotation correction, OCR often returns nothing.
     coords = np.column_stack(np.where(binary > 0))
     angle = cv2.minAreaRect(coords)[-1]
-    # Adjust angle convention for OpenCV
     if angle < -45:
         angle = -(90 + angle)
     else:
         angle = -angle
-    # Rotate only if the skew is noticeable (>0.5 degrees) to avoid interpolation artifacts
     if abs(angle) > 0.5:
         (h, w) = binary.shape[:2]
         center = (w // 2, h // 2)
         M = cv2.getRotationMatrix2D(center, angle, 1.0)
         binary = cv2.warpAffine(binary, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
-        logger.info(f"Image deskewed by {angle:.2f} degrees.")
     return Image.fromarray(binary)
 def extract_text_with_preprocessing(file_path: str) -> str:
     """
-    Pipeline: Load -> High-DPI Convert -> 4-Step Preprocess -> Tesseract -> Text
     """
     if not os.path.exists(file_path):
         return ""
     text_content = ""
     try:
-        # Load PDF at 300 DPI - Essential for Tesseract accuracy
         if file_path.lower().endswith('.pdf'):
             images = convert_from_path(file_path, dpi=300)
         else:
             images = [Image.open(file_path)]
         for i, img in enumerate(images):
-            # Apply the 4-step video pipeline
-            processed_img = preprocess_image(img)
             # Tesseract Config:
-            # --psm 4: Assume variable size text (good for invoices)
-            # preserve_interword_spaces: Helps extraction of table columns
             custom_config = r'--oem 3 --psm 4 -c preserve_interword_spaces=1'
             page_text = pytesseract.image_to_string(processed_img, config=custom_config)
@@ -96,190 +83,4 @@ def extract_text_with_preprocessing(file_path: str) -> str:
         logger.error(f"Preprocessing/OCR Error: {e}")
         return f"Error processing file: {str(e)}"
-    return text_content.strip()
-```
-### 2. Refined `prompts.py`
-**Changes:** Since preprocessing (deskewing/normalization) yields cleaner text, we can be stricter in the SOP. I have updated the System Prompt to explicitly map the "Golden Sample" logic to the output.
-```python
-def get_ocr_extraction_prompt(raw_text: str) -> str:
-    """
-    Returns a strict prompt with SOP and One-Shot example.
-    Refined to handle 'Line Items' specifically as preprocessing makes tables more readable.
-    """
-    return f"""<|im_start|>system
-You are a precise Invoice Data Extraction Agent.
-Your input is raw OCR text from a pre-processed invoice image.
-### STANDARD OPERATING PROCEDURE (SOP):
-1. **Header Extraction**: Identify the Vendor Name, Invoice Number, and Dates (Invoice & Due).
-2. **Table Parsing**: The OCR preserves inter-word spacing. Use this to identify the 'Line Items' table.
-3. **Normalization**:
-   - Dates must be YYYY-MM-DD.
-   - Amounts must be floats (no currency symbols).
-4. **Validation**: If 'Total Amount' is missing, calculate it from line items if possible.
-5. **Output Format**: Return ONLY valid JSON. No Markdown block markers (```json).
-### ONE-SHOT EXAMPLE (City of Auburn Invoice):
-**Input OCR**:
-"CITY OF AUBURN... 076248-000... Due: 01/07/25...
-Water Total $649.69... Sewer Total $1,333.45... Total New Charges $2,363.39"
-**Correct JSON**:
-{{
-    "invoice_number": "076248-000",
-    "vendor_name": "City of Auburn",
-    "invoice_date": "2024-12-18",
-    "due_date": "2025-01-07",
-    "total_amount": 2363.39,
-    "line_items": [
-        {{"description": "Water Total", "quantity": 1, "rate": 649.69, "amount": 649.69}},
-        {{"description": "Sewer Total", "quantity": 1, "rate": 1333.45, "amount": 1333.45}}
-    ]
-}}
-<|im_end|>
-<|im_start|>user
-### TARGET INVOICE OCR DATA:
-{raw_text[:4000]}
-Return the JSON:
-<|im_end|>
-<|im_start|>assistant
-{{
-""" # Pre-fill brace to force Qwen into JSON mode
-```
-### 3. `config.py` (Modular Configuration)
-**Changes:** Decouples the "Canonical Schema" (from `prompts.py`) from Zoho's specific API keys. This allows you to edit the field mapping without changing the AI logic.
-```python
-# config.py
-# --- Zoho API Configuration ---
-# DO NOT COMMIT REAL CREDENTIALS TO GIT
-CLIENT_ID = os.getenv("ZOHO_CLIENT_ID", "YOUR_CLIENT_ID")
-CLIENT_SECRET = os.getenv("ZOHO_CLIENT_SECRET", "YOUR_CLIENT_SECRET")
-REFRESH_TOKEN = os.getenv("ZOHO_REFRESH_TOKEN", "YOUR_REFRESH_TOKEN")
-API_BASE = "https://www.zohoapis.in/crm/v2"
-# --- Schema Mapper ---
-# Maps LLM 'Canonical' keys -> Zoho CRM/Books API keys
-# If you switch CRM, you only change the right-hand side.
-ZOHO_INVOICE_MAP = {
-    "invoice_number": "invoice_number",  # Canonical : Zoho
-    "vendor_name": "customer_name",      # Note: Requires customer ID lookup in logic
-    "invoice_date": "date",
-    "due_date": "due_date",
-    "total_amount": "total",
-    "line_items": "line_items"
-}
-ZOHO_LINE_ITEM_MAP = {
-    "description": "name",
-    "quantity": "quantity",
-    "rate": "rate",
-    "amount": "item_total"
-}
-```
-### 4. `app.py` (Pipeline Integration)
-**Changes:** Integrates the new `ocr_preprocessing_engine`, applies the mapping from `config.py`, and maintains the FastMCP structure.
-```python
-from mcp.server.fastmcp import FastMCP
-import logging
-import json
-import os
-from ocr_preprocessing_engine import extract_text_with_preprocessing
-from prompts import get_ocr_extraction_prompt
-from config import ZOHO_INVOICE_MAP, ZOHO_LINE_ITEM_MAP
-# Initialize FastMCP
-mcp = FastMCP("ZohoInvoiceAgent")
-logger = logging.getLogger("mcp_server")
-def map_canonical_to_zoho(canonical_data: dict) -> dict:
-    """
-    Transforms generic LLM JSON into Zoho-ready JSON using config maps.
-    """
-    zoho_payload = {}
-    # 1. Map Top-Level Fields
-    for llm_key, zoho_key in ZOHO_INVOICE_MAP.items():
-        if llm_key in canonical_data and llm_key != "line_items":
-            zoho_payload[zoho_key] = canonical_data[llm_key]
-    # 2. Map Line Items
-    if "line_items" in canonical_data and isinstance(canonical_data["line_items"], list):
-        zoho_items = []
-        for item in canonical_data["line_items"]:
-            new_item = {}
-            for l_key, z_key in ZOHO_LINE_ITEM_MAP.items():
-                if l_key in item:
-                    new_item[z_key] = item[l_key]
-            # Zoho API often requires quantity default to 1 if missing
-            if "quantity" not in new_item:
-                new_item["quantity"] = 1
-            zoho_items.append(new_item)
-        zoho_payload["line_items"] = zoho_items
-    return zoho_payload
-@mcp.tool()
-def process_invoice_document(file_path: str) -> dict:
-    """
-    MCP Tool: Takes an invoice PDF/Image, runs strict preprocessing (Normalize->Deskew->Threshold),
-    extracts data via Qwen 2.5, and maps it to Zoho API format.
-    """
-    if not os.path.exists(file_path):
-        return {"error": "File not found"}
-    # Step 1: Enhanced OCR Preprocessing
-    # This step is critical to fix rotation and contrast issues before Tesseract runs.
-    raw_text = extract_text_with_preprocessing(file_path)
-    if len(raw_text) < 50:
-        return {"error": "OCR failed. Image may be too blurry or blank."}
-    # Step 2: LLM Extraction (Qwen 2.5)
-    prompt = get_ocr_extraction_prompt(raw_text)
-    # Mocking local_llm_generate for this snippet - ensure this connects to your Qwen pipeline
-    # Ensure do_sample=False (Greedy Decoding) to reduce erratic json
-    # response = local_llm_generate(prompt, max_tokens=500, do_sample=False)
-    # --- SIMULATED RESPONSE FOR DEMO ---
-    # In production, replace this with actual model generation
-    logger.info("Sending text to LLM...")
-    # -----------------------------------
-    try:
-        # Assuming response["text"] contains the JSON
-        # Here we pretend the LLM returned the canonical JSON structure
-        # canonical_data = json.loads("{" + response["text"])
-        # For demonstration, let's assume valid extraction:
-        canonical_data = {
-            "invoice_number": "INV-001",
-            "total_amount": 100.00,
-            "line_items": [{"description": "Service", "rate": 100.00}]
-        }
-        # Step 3: Map to Zoho Structure
-        zoho_ready_data = map_canonical_to_zoho(canonical_data)
-        return {
-            "status": "success",
-            "source_file": os.path.basename(file_path),
-            "canonical_data": canonical_data, # Useful for debugging/user verification
-            "zoho_payload": zoho_ready_data # Ready for the create_invoice tool
-        }
-    except Exception as e:
-        return {"error": f"Processing failed: {str(e)}"}
-if __name__ == "__main__":
-    mcp.run()
-```

 import cv2
 import numpy as np
 import pytesseract
 logger = logging.getLogger("ocr_preprocessor")
+def preprocess_image_for_ocr(image: Image.Image) -> Image.Image:
     """
+    Applies the 4-step OCR enhancement pipeline (Source: Make OCR Actually Work):
     1. Normalization (Contrast Stretching)
+    2. Denoising (Gaussian Blur)
+    3. Deskewing (Rotation Correction)
     4. Thresholding (Binarization)
     """
     # Convert PIL to OpenCV format
     img_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
+    # 1. Normalization: Maximize contrast range
+    norm_img = np.zeros((img_cv.shape, img_cv.shape[5]), dtype=np.uint8)
     img_cv = cv2.normalize(img_cv, norm_img, 0, 255, cv2.NORM_MINMAX)
+    # Convert to Grayscale
     gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
+    # 2. Denoising: Remove scanning artifacts
     denoised = cv2.GaussianBlur(gray, (5, 5), 0)
+    # 3. Thresholding (Binarization): Adaptive Otsu's method
+    # This separates text (foreground) from background noise
     _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    # 4. Deskewing: Fix rotation
     coords = np.column_stack(np.where(binary > 0))
     angle = cv2.minAreaRect(coords)[-1]
+    # Adjust OpenCV angle calculation
     if angle < -45:
         angle = -(90 + angle)
     else:
         angle = -angle
+    # Rotate only if skew is significant (>0.5 degrees)
     if abs(angle) > 0.5:
         (h, w) = binary.shape[:2]
         center = (w // 2, h // 2)
         M = cv2.getRotationMatrix2D(center, angle, 1.0)
         binary = cv2.warpAffine(binary, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
     return Image.fromarray(binary)
 def extract_text_with_preprocessing(file_path: str) -> str:
     """
+    Converts PDF to 300 DPI images (Source [6]), pre-processes them,
+    and runs Tesseract with layout preservation.
     """
     if not os.path.exists(file_path):
         return ""
     text_content = ""
     try:
+        # Load PDF at 300 DPI (Tesseract optimal standard)
         if file_path.lower().endswith('.pdf'):
             images = convert_from_path(file_path, dpi=300)
         else:
             images = [Image.open(file_path)]
         for i, img in enumerate(images):
+            processed_img = preprocess_image_for_ocr(img)
             # Tesseract Config:
+            # --psm 4: Single column variable size (good for invoice layouts)
+            # preserve_interword_spaces=1: Helps LLM detect table columns
             custom_config = r'--oem 3 --psm 4 -c preserve_interword_spaces=1'
             page_text = pytesseract.image_to_string(processed_img, config=custom_config)
         logger.error(f"Preprocessing/OCR Error: {e}")
         return f"Error processing file: {str(e)}"
+    return text_content.strip()