mcp_ocr_json

Sleeping

App Files Files Community

Vachudev commited on Dec 4, 2025

Commit

8850f3e

verified ·

1 Parent(s): 9c8d087

adding preprocessing for pdfs

Browse files

preprocess pdfs to achieve better OCR.

Files changed (1) hide show

ocr_preprocessing_engine.py +285 -0

ocr_preprocessing_engine.py ADDED Viewed

	@@ -0,0 +1,285 @@

+Based on the "Make OCR Actually Work" video, the failure of OCR is often due to skipping four specific preprocessing steps: **Normalization, Denoising, Deskewing, and Thresholding**. The video demonstrates that even advanced Transformer models fail if an image is rotated or has poor contrast.
+Here is the updated modular pipeline. I have rewritten `ocr_preprocessing_engine.py` to strictly implement the 4-step workflow highlighted in the video, and refined the `prompts.py` to take advantage of the cleaner text output.
+### 1. Improved `ocr_preprocessing_engine.py`
+**Changes:** Added explicit **Normalization** (contrast stretching) and **Denoising** steps before Binarization, as emphasized in the video source.
+```python
+import cv2
+import numpy as np
+import pytesseract
+from pdf2image import convert_from_path
+from PIL import Image
+import os
+import logging
+logger = logging.getLogger("ocr_preprocessor")
+def preprocess_image(image: Image.Image) -> Image.Image:
+    """
+    Implements the 4-step pipeline from the 'Make OCR Work' video source:
+    1. Normalization (Contrast Stretching)
+    2. Denoising
+    3. Deskewing
+    4. Thresholding (Binarization)
+    """
+    # Convert PIL to OpenCV format
+    img_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)
+    # 1. Normalization: Stretch pixel intensity to 0-255 range
+    # This fixes images that look "washed out" or "completely black" due to bad contrast.
+    norm_img = np.zeros((img_cv.shape, img_cv.shape), dtype=np.uint8)
+    img_cv = cv2.normalize(img_cv, norm_img, 0, 255, cv2.NORM_MINMAX)
+    # Convert to Grayscale for further processing
+    gray = cv2.cvtColor(img_cv, cv2.COLOR_BGR2GRAY)
+    # 2. Denoising: Remove speckles/artifacts
+    # fastNlMeans is effective but slow; using GaussianBlur as a faster CPU-friendly alternative
+    denoised = cv2.GaussianBlur(gray, (5, 5), 0)
+    # 3. Thresholding (Binarization)
+    # The video suggests finding the right value. Otsu's method (THRESH_OTSU) automatically
+    # finds the optimal threshold value to separate text (foreground) from background.
+    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    # 4. Deskewing
+    # The video notes that without rotation correction, OCR often returns nothing.
+    coords = np.column_stack(np.where(binary > 0))
+    angle = cv2.minAreaRect(coords)[-1]
+    # Adjust angle convention for OpenCV
+    if angle < -45:
+        angle = -(90 + angle)
+    else:
+        angle = -angle
+    # Rotate only if the skew is noticeable (>0.5 degrees) to avoid interpolation artifacts
+    if abs(angle) > 0.5:
+        (h, w) = binary.shape[:2]
+        center = (w // 2, h // 2)
+        M = cv2.getRotationMatrix2D(center, angle, 1.0)
+        binary = cv2.warpAffine(binary, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
+        logger.info(f"Image deskewed by {angle:.2f} degrees.")
+    return Image.fromarray(binary)
+def extract_text_with_preprocessing(file_path: str) -> str:
+    """
+    Pipeline: Load -> High-DPI Convert -> 4-Step Preprocess -> Tesseract -> Text
+    """
+    if not os.path.exists(file_path):
+        return ""
+    text_content = ""
+    try:
+        # Load PDF at 300 DPI - Essential for Tesseract accuracy
+        if file_path.lower().endswith('.pdf'):
+            images = convert_from_path(file_path, dpi=300)
+        else:
+            images = [Image.open(file_path)]
+        for i, img in enumerate(images):
+            # Apply the 4-step video pipeline
+            processed_img = preprocess_image(img)
+            # Tesseract Config:
+            # --psm 4: Assume variable size text (good for invoices)
+            # preserve_interword_spaces: Helps extraction of table columns
+            custom_config = r'--oem 3 --psm 4 -c preserve_interword_spaces=1'
+            page_text = pytesseract.image_to_string(processed_img, config=custom_config)
+            text_content += f"--- Page {i+1} ---\n{page_text}\n"
+    except Exception as e:
+        logger.error(f"Preprocessing/OCR Error: {e}")
+        return f"Error processing file: {str(e)}"
+    return text_content.strip()
+```
+### 2. Refined `prompts.py`
+**Changes:** Since preprocessing (deskewing/normalization) yields cleaner text, we can be stricter in the SOP. I have updated the System Prompt to explicitly map the "Golden Sample" logic to the output.
+```python
+def get_ocr_extraction_prompt(raw_text: str) -> str:
+    """
+    Returns a strict prompt with SOP and One-Shot example.
+    Refined to handle 'Line Items' specifically as preprocessing makes tables more readable.
+    """
+    return f"""<|im_start|>system
+You are a precise Invoice Data Extraction Agent.
+Your input is raw OCR text from a pre-processed invoice image.
+### STANDARD OPERATING PROCEDURE (SOP):
+1. **Header Extraction**: Identify the Vendor Name, Invoice Number, and Dates (Invoice & Due).
+2. **Table Parsing**: The OCR preserves inter-word spacing. Use this to identify the 'Line Items' table.
+3. **Normalization**:
+   - Dates must be YYYY-MM-DD.
+   - Amounts must be floats (no currency symbols).
+4. **Validation**: If 'Total Amount' is missing, calculate it from line items if possible.
+5. **Output Format**: Return ONLY valid JSON. No Markdown block markers (```json).
+### ONE-SHOT EXAMPLE (City of Auburn Invoice):
+**Input OCR**:
+"CITY OF AUBURN... 076248-000... Due: 01/07/25...
+Water Total $649.69... Sewer Total $1,333.45... Total New Charges $2,363.39"
+**Correct JSON**:
+{{
+    "invoice_number": "076248-000",
+    "vendor_name": "City of Auburn",
+    "invoice_date": "2024-12-18",
+    "due_date": "2025-01-07",
+    "total_amount": 2363.39,
+    "line_items": [
+        {{"description": "Water Total", "quantity": 1, "rate": 649.69, "amount": 649.69}},
+        {{"description": "Sewer Total", "quantity": 1, "rate": 1333.45, "amount": 1333.45}}
+    ]
+}}
+<|im_end|>
+<|im_start|>user
+### TARGET INVOICE OCR DATA:
+{raw_text[:4000]}
+Return the JSON:
+<|im_end|>
+<|im_start|>assistant
+{{
+""" # Pre-fill brace to force Qwen into JSON mode
+```
+### 3. `config.py` (Modular Configuration)
+**Changes:** Decouples the "Canonical Schema" (from `prompts.py`) from Zoho's specific API keys. This allows you to edit the field mapping without changing the AI logic.
+```python
+# config.py
+# --- Zoho API Configuration ---
+# DO NOT COMMIT REAL CREDENTIALS TO GIT
+CLIENT_ID = os.getenv("ZOHO_CLIENT_ID", "YOUR_CLIENT_ID")
+CLIENT_SECRET = os.getenv("ZOHO_CLIENT_SECRET", "YOUR_CLIENT_SECRET")
+REFRESH_TOKEN = os.getenv("ZOHO_REFRESH_TOKEN", "YOUR_REFRESH_TOKEN")
+API_BASE = "https://www.zohoapis.in/crm/v2"
+# --- Schema Mapper ---
+# Maps LLM 'Canonical' keys -> Zoho CRM/Books API keys
+# If you switch CRM, you only change the right-hand side.
+ZOHO_INVOICE_MAP = {
+    "invoice_number": "invoice_number",  # Canonical : Zoho
+    "vendor_name": "customer_name",      # Note: Requires customer ID lookup in logic
+    "invoice_date": "date",
+    "due_date": "due_date",
+    "total_amount": "total",
+    "line_items": "line_items"
+}
+ZOHO_LINE_ITEM_MAP = {
+    "description": "name",
+    "quantity": "quantity",
+    "rate": "rate",
+    "amount": "item_total"
+}
+```
+### 4. `app.py` (Pipeline Integration)
+**Changes:** Integrates the new `ocr_preprocessing_engine`, applies the mapping from `config.py`, and maintains the FastMCP structure.
+```python
+from mcp.server.fastmcp import FastMCP
+import logging
+import json
+import os
+from ocr_preprocessing_engine import extract_text_with_preprocessing
+from prompts import get_ocr_extraction_prompt
+from config import ZOHO_INVOICE_MAP, ZOHO_LINE_ITEM_MAP
+# Initialize FastMCP
+mcp = FastMCP("ZohoInvoiceAgent")
+logger = logging.getLogger("mcp_server")
+def map_canonical_to_zoho(canonical_data: dict) -> dict:
+    """
+    Transforms generic LLM JSON into Zoho-ready JSON using config maps.
+    """
+    zoho_payload = {}
+    # 1. Map Top-Level Fields
+    for llm_key, zoho_key in ZOHO_INVOICE_MAP.items():
+        if llm_key in canonical_data and llm_key != "line_items":
+            zoho_payload[zoho_key] = canonical_data[llm_key]
+    # 2. Map Line Items
+    if "line_items" in canonical_data and isinstance(canonical_data["line_items"], list):
+        zoho_items = []
+        for item in canonical_data["line_items"]:
+            new_item = {}
+            for l_key, z_key in ZOHO_LINE_ITEM_MAP.items():
+                if l_key in item:
+                    new_item[z_key] = item[l_key]
+            # Zoho API often requires quantity default to 1 if missing
+            if "quantity" not in new_item:
+                new_item["quantity"] = 1
+            zoho_items.append(new_item)
+        zoho_payload["line_items"] = zoho_items
+    return zoho_payload
+@mcp.tool()
+def process_invoice_document(file_path: str) -> dict:
+    """
+    MCP Tool: Takes an invoice PDF/Image, runs strict preprocessing (Normalize->Deskew->Threshold),
+    extracts data via Qwen 2.5, and maps it to Zoho API format.
+    """
+    if not os.path.exists(file_path):
+        return {"error": "File not found"}
+    # Step 1: Enhanced OCR Preprocessing
+    # This step is critical to fix rotation and contrast issues before Tesseract runs.
+    raw_text = extract_text_with_preprocessing(file_path)
+    if len(raw_text) < 50:
+        return {"error": "OCR failed. Image may be too blurry or blank."}
+    # Step 2: LLM Extraction (Qwen 2.5)
+    prompt = get_ocr_extraction_prompt(raw_text)
+    # Mocking local_llm_generate for this snippet - ensure this connects to your Qwen pipeline
+    # Ensure do_sample=False (Greedy Decoding) to reduce erratic json
+    # response = local_llm_generate(prompt, max_tokens=500, do_sample=False)
+    # --- SIMULATED RESPONSE FOR DEMO ---
+    # In production, replace this with actual model generation
+    logger.info("Sending text to LLM...")
+    # -----------------------------------
+    try:
+        # Assuming response["text"] contains the JSON
+        # Here we pretend the LLM returned the canonical JSON structure
+        # canonical_data = json.loads("{" + response["text"])
+        # For demonstration, let's assume valid extraction:
+        canonical_data = {
+            "invoice_number": "INV-001",
+            "total_amount": 100.00,
+            "line_items": [{"description": "Service", "rate": 100.00}]
+        }
+        # Step 3: Map to Zoho Structure
+        zoho_ready_data = map_canonical_to_zoho(canonical_data)
+        return {
+            "status": "success",
+            "source_file": os.path.basename(file_path),
+            "canonical_data": canonical_data, # Useful for debugging/user verification
+            "zoho_payload": zoho_ready_data # Ready for the create_invoice tool
+        }
+    except Exception as e:
+        return {"error": f"Processing failed: {str(e)}"}
+if __name__ == "__main__":
+    mcp.run()
+```