SmartContractMigrator

Sleeping

pavansuresh commited on Jul 3, 2025

Commit

a50fdc7

verified ·

1 Parent(s): 0c1f04f

Create ocr_utils.py

Files changed (1) hide show

ocr_utils.py ADDED Viewed

+from pdf2image import convert_from_path
+import pytesseract
+import tempfile
+def extract_text_from_pdf(pdf_path):
+    """
+    Extracts text from a scanned PDF using OCR (Tesseract).
+    Converts PDF to images and runs pytesseract on each page.
+    """
+    with tempfile.TemporaryDirectory() as tempdir:
+        images = convert_from_path(pdf_path, dpi=300, output_folder=tempdir)
+        all_text = []
+        for img in images:
+            text = pytesseract.image_to_string(img)
+            all_text.append(text)
+        return "\n".join(all_text)