import gradio as gr import torch import os import re import tempfile from PIL import Image from docx import Document from bs4 import BeautifulSoup from threading import Thread # --- Transformers Import --- try: from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor, TextIteratorStreamer except ImportError as e: raise ImportError("Transformers library not found. Please install git+https://github.com/huggingface/transformers.git") from e # --- Global Model Loading --- print("Loading AI Model (2.1B Parameters)... This may take a minute...") try: # OPTIMIZATION: Check for CUDA but don't force it if we are on a CPU tier to avoid errors if torch.cuda.is_available(): device = "cuda" dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 print(f"Running on GPU: {torch.cuda.get_device_name(0)}") else: device = "cpu" dtype = torch.float32 # CPUs handle float32 best print("Running on CPU mode") model_id = "lightonai/LightOnOCR-2-1B" processor = LightOnOcrProcessor.from_pretrained(model_id) # Load model model = LightOnOcrForConditionalGeneration.from_pretrained( model_id, torch_dtype=dtype, attn_implementation="sdpa", # Use SDPA for both CPU and GPU (faster on PyTorch 2.0+) low_cpu_mem_usage=True ).to(device) model.eval() print("Model Loaded Successfully!") except Exception as e: print(f"Failed to load model: {e}") model = None processor = None # --- Helper Functions --- def resize_for_ocr(image, max_dim=768): """ Resize image to be faster. Lowered max_dim from 1280->896->768 for CPU deployment to ensure reasonable speed. """ if image is None: return None w, h = image.size if max(w, h) > max_dim: scale = max_dim / max(w, h) new_w = int(w * scale) new_h = int(h * scale) return image.resize((new_w, new_h), Image.Resampling.LANCZOS) return image def clean_latex_for_word(text): """Clean simple LaTeX commands for better readability in Word.""" text = re.sub(r'\\begin\{array\}\{.*?\}', '', text) text = text.replace(r'\end{array}', '') text = re.sub(r'\\text\{([^}]*)\}', r'\1', text) text = re.sub(r'\\textbf\{([^}]*)\}', r'\1', text) text = re.sub(r'\\textit\{([^}]*)\}', r'\1', text) text = text.replace(r'\\', '\n') text = text.replace(r'\rightarrow', '→').replace(r'\leftarrow', '←') text = text.replace(r'\leftrightarrow', '↔').replace(r'\Rightarrow', '⇒') text = text.replace(r'\downarrow', '↓').replace(r'\uparrow', '↑') text = text.replace(r'\ldots', '...').replace(r'\cdots', '...') text = text.replace(r'\times', '×').replace(r'\approx', '≈') text = text.replace(r'\le', '≤').replace(r'\ge', '≥') return text def format_latex_for_display(text): """ Auto-detects lines containing LaTeX (math/chemical equations) and wraps them in $$ so Gradio/Markdown renders them correctly. """ lines = text.split('\n') formatted = [] # Regex to detect lines that look like chemical equations (have arrows, subscripts, superscripts) # Checks for: \xrightarrow, \rightarrow, _{num}, ^{num}, \frac, etc. chem_pattern = re.compile(r"(\\xrightarrow|\\rightarrow|\\frac|\^\{|_\{|_[0-9]|[A-Z][a-z]?_\d)") for line in lines: # If line contains LaTeX indicators and isn't already wrapped in $$ if chem_pattern.search(line) and "$$" not in line: # Avoid wrapping lines that look like plain text but just have one subscript # But for chemistry usually even simple formulas look better in math mode formatted.append(f"$${line}$$") else: formatted.append(line) return "\n".join(formatted) def process_markdown_segment(text, doc): """Process standard markdown text lines.""" lines = text.split('\n') for line in lines: line = line.strip() if not line: continue line = clean_latex_for_word(line) if line.startswith('#'): parts = line.split(' ', 1) if len(parts) > 1: hashes, content = parts if all(c == '#' for c in hashes): doc.add_heading(content, level=min(len(hashes), 9)) continue if '$' in line: p = doc.add_paragraph() parts = line.split('$') for i, part in enumerate(parts): if i % 2 == 1: run = p.add_run(part) run.italic = True run.font.name = 'Cambria Math' else: p.add_run(part) continue if line.startswith('- ') or line.startswith('* '): doc.add_paragraph(line[2:].strip(), style='List Bullet') else: doc.add_paragraph(line) def process_html_table(html_str, doc): """Parse HTML table and add to Docx.""" try: soup = BeautifulSoup(html_str, 'html.parser') rows = soup.find_all('tr') if not rows: return max_cols = max([len(row.find_all(['td', 'th'])) for row in rows]) if rows else 0 if max_cols == 0: return table = doc.add_table(rows=len(rows), cols=max_cols) table.style = 'Table Grid' for i, row in enumerate(rows): cols = row.find_all(['td', 'th']) for j, col in enumerate(cols): if j < max_cols: table.cell(i, j).text = col.get_text(strip=True) except Exception as e: doc.add_paragraph(f"[Error parsing table]") def markdown_to_docx(text): """Convert extracted text to Docx object.""" doc = Document() table_pattern = re.compile(r'(.*?)', re.IGNORECASE | re.DOTALL) parts = table_pattern.split(text) for part in parts: if not part.strip(): continue if part.strip().lower().startswith('

🤖 Ultra OCR

Crafted with ❤️ by The Best Team

""" ) with gr.Row(equal_height=False, variant="panel"): with gr.Column(scale=4): input_img = gr.Image( type="pil", label="Document Source", height=500, sources=['upload', 'clipboard'], format="png", show_label=False # Hide label to prevent text overlay on image ) run_btn = gr.Button("⚡ Start Transcription", variant="primary", size="lg") with gr.Column(scale=5): with gr.Tabs(): with gr.TabItem("📝 Live Text"): output_text = gr.Markdown( label="Real-time Extraction", elem_classes=["scrollable-md"] ) with gr.TabItem("💾 Export"): gr.Markdown("### Download Results") output_file = gr.File(label="Download Word (.docx)", type="filepath") # Example Gallery if example_images: gr.HTML("
") gr.Markdown("### 📂 Sample Documents") gr.Examples( examples=example_images, inputs=input_img, label="Click a sample to test", examples_per_page=5 ) # Interactions run_btn.click( fn=stream_ocr, inputs=[input_img], outputs=[output_text, output_file], concurrency_limit=5 ) if __name__ == "__main__": # Removed ssr_mode=False to fix gallery previews. # Using absolute paths with allowed_paths matches the working app.py config. demo.launch( allowed_paths=[os.path.dirname(os.path.abspath(__file__))] )