Spaces:

DocUA
/

Local_OCR_Demo

Running on Zero

App Files Files Community

DocUA commited on Jan 28

Commit

b752d16

0 Parent(s):

Initial commit: DeepSeek-OCR-2 & MedGemma-1.5 multimodal analysis app with ZeroGPU support

Browse files

Files changed (16) hide show

.gitignore +32 -0
OCR_ANALYSIS_REPORT.md +57 -0
README.md +60 -0
README_HF.md +31 -0
app.py +264 -0
app_hf.py +260 -0
compare_models.py +105 -0
convert_docs.py +31 -0
convert_full_pdf.py +32 -0
generate_test_image.py +21 -0
ocr_full_pdf12.py +89 -0
requirements.txt +16 -0
test_inference.py +66 -0
test_medgemma.py +64 -0
test_minimal.py +31 -0
test_real_docs.py +82 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,32 @@

+# Virtual environment
+venv/
+.venv/
+env/
+# Data and Results
+doc_for_testing/
+doc_images/
+doc_images_full/
+ocr_results/
+ocr_results_package/
+ocr_results_pdf12/
+outputs/
+# Temporary and generated files
+*.zip
+*.jpg
+*.png
+*.pdf
+!sample_test.png # In case this one is needed for examples in app or readme, but user said "all folders with tests and results"
+temp_comp.png
+ocr_result_*.txt
+# Python cache
+__pycache__/
+*.py[cod]
+*$py.class
+# IDEs
+.vscode/
+.idea/
+.DS_Store

OCR_ANALYSIS_REPORT.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# Аналіз продуктивності та точності DeepSeek-OCR-2
+**Дата:** 28 січня 2026
+**Тестовий файл:** `doc_for_testing/pdf12_un.pdf` (13 сторінок)
+**Середовище:** Apple M3 Max (CPU Inference, float32)
+---
+## 1. Аналіз точності (Accuracy)
+**Загальна оцінка:** 8/10
+Модель демонструє високий рівень розуміння контексту тa структури документа, але має специфічні проблеми, характерні для Великих Мовних Моделей (LLM).
+### ✅ Сильні сторони
+*   **Глибоке розуміння контексту:** Модель чудово розрізняє секції документа ("Impression", "Plan", "Vitals"). Вихідний формат Markdown чистий і готовий до використання.
+*   **Медична термінологія:** Специфічні терміни розпізнані майже бездоганно (напр., *Gastroesophageal reflux disease*, *Cholecystectomy*, *Tissue Transglutaminase*).
+*   **Робота з таблицями:** Модель коректно перетворює візуальні таблиці у Markdown-таблиці, зберігаючи логічний зв'язок даних.
+*   **Стійкість до шумів:** Добре справляється з різними шрифтами та форматуванням.
+### ⚠️ Критичні проблеми (Слабкі сторони)
+*   **Галюцинації у власних назвах (Hallucinations):** Це найсерйозніша проблема. Модель схильна "додумувати" назви брендів чи організацій, якщо текст нечіткий або логотип складний.
+    *   *Atrium Health* $\rightarrow$ розпізнано як **"Arthur Health"**.
+    *   *Carolina Imaging Services* $\rightarrow$ розпізнано як **"Carlos Alings Ingegvers"**.
+*   **Дрібні помилки розпізнавання:**
+    *   *Post-menopausal* $\rightarrow$ **"Pilot-menopausal"**.
+    *   Дублювання відповідей у чек-лістах (напр., "No No" замість "No").
+---
+## 2. Аналіз швидкості (Performance)
+**Загальна оцінка (CPU):** 6/10
+Швидкість тестувалася на CPU через обмежену підтримку MPS (Metal Performance Shaders) для специфічних шарів MoE (Mixture of Experts) у поточній версії коду DeepSeek.
+*   **Середній час на сторінку:** ~19-20 секунд.
+    *   *Найшвидша:* ~7.4 с (сторінки з малою кількістю тексту).
+    *   *Найповільніша:* ~29 с (насичені сторінки).
+*   **Повний цикл (13 сторінок):** ~4.5 - 5 хвилин.
+**Висновок по швидкості:** На CPU модель придатна лише для фонової пакетної обробки (batch processing). Для інтерактивної роботи (real-time) швидкість є недостатньою.
+---
+## 3. Рекомендації
+### Для покращення точності:
+1.  **Пост-обробка (Post-processing):** Впровадити словник-валідатор для критично важливих сутностей (Known Entities). Наприклад, автоматична заміна "Arthur Health" на "Atrium Health" на основі списку відомих клінік.
+2.  **Гібридний підхід:** Використовувати класичний OCR (наприклад, Tesseract або PaddleOCR) для витягування точних назв ("сирого тексту"), а DeepSeek-OCR-2 використовувати для структурування та розуміння семантики.
+### Для покращення швидкості:
+1.  **GPU Інференс:** Перехід на NVIDIA GPU (CUDA) є обов'язковим для продакшн-середовища. Це дозволить прискорити обробку в 10-20 разів (до ~1-2 секунд на сторінку).
+2.  **Квантування (Quantization):** Розглянути можливість використання 4-bit або 8-bit квантування (GGUF/AWQ), якщо точність не постраждає критично. Це значно прискорить роботу навіть на CPU/Mac.
+### Цільове використання:
+DeepSeek-OCR-2 ідеально підходить для **ETL-процесів** (Extract, Transform, Load), де потрібно перетворити неструктуровані PDF/Зображення у структуровані дані (JSON/Markdown) для подальшого аналізу. Вона менш придатна для задач, де потрібна 100% посимвольна точність без "творчості" (наприклад, розпізнавання кодів чи серійних номерів).

README.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# DeepSeek-OCR-2 & MedGemma-1.5 Multimodal Analysis
+Цей проект призначений для аналізу медичних та загальних документів за допомогою сучасних мультимодальних моделей: **DeepSeek-OCR-2** та **MedGemma-1.5-4B-IT**.
+## 🚀 Основні можливості
+- **DeepSeek-OCR-2**: Високоточне розпізнавання тексту (OCR) на основі архітектури Mixture-of-Experts (MoE).
+- **MedGemma-1.5-4B-IT**: Мультимодальна модель від Google, спеціалізована на медичних зображеннях та текстах (архітектура Gemma 3 / PaliGemma).
+- **Веб-інтерфейс Gradio**: Зручне завантаження зображень/PDF, вибір моделі та візуалізація результатів.
+- **Порівняння моделей**: Спеціальний інструмент для одночасного аналізу однієї сторінки обома моделями.
+- **Оптимізація для Mac**: Патчі для підтримки MPS (Metal Performance Shaders) та виправлення сумісності з новими версіями `transformers`.
+## 📦 Склад проекту
+- `app.py`: Головний застосунок з інтерфейсом Gradio.
+- `compare_models.py`: Скрипт для порівняльного аналізу DeepSeek та MedGemma.
+- `test_medgemma.py`: Тестовий скрипт для перевірки працездатності MedGemma.
+- `outputs/`: Директорія для збереження результатів аналізу.
+- `venv/`: Віртуальне середовище Python 3.11.9.
+## 🛠 Інструкція з налаштування
+### 1. Підготовка середовища
+```bash
+# Активація віртуального середовища
+source venv/bin/activate
+# Встановлення необхідних бібліотек (якщо потрібно оновити)
+pip install -r requirements.txt
+```
+### 2. Доступ до MedGemma
+Для роботи з `google/medgemma-1.5-4b-it` необхідно:
+1. Мати аккаунт на Hugging Face.
+2. Погодитися з умовами використання моделі на [сторінці моделі](https://huggingface.co/google/medgemma-1.5-4b-it).
+3. Авторизуватися локально: `huggingface-cli login`.
+## 🖥 Як запустити
+### Запуск веб-інтерфейсу
+```bash
+python app.py
+```
+Після запуску відкрийте посилання в браузері (зазвичай `http://127.0.0.1:7860`).
+### Порівняння результатів
+```bash
+python compare_models.py
+```
+Результат буде збережено у файл `model_comparison.md`.
+## 🍎 Примітки для macOS (M1/M2/M3)
+Проект містить автоматичні виправлення (monkeypatching) для:
+1. **Сумісності з Transformers 5.0**: Виправлено помилки імпорту `LlamaFlashAttention2` та `DynamicCache`.
+2. **MPS Acceleration**: Автоматичне використання GPU Mac там, де це можливо (float16).
+3. **MoE на CPU**: Оскільки DeepSeek MoE має обмежену підтримку MPS, деякі його частини автоматично перемикаються на CPU для стабільності.
+---
+*Проект розроблено для тестування та демонстрації можливостей сучасних LLM у сфері розпізнавання медичних документів.*

README_HF.md ADDED Viewed

	@@ -0,0 +1,31 @@

+---
+title: Local OCR Demo
+emoji: 🔍
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+sdk_version: 4.44.1
+app_file: app_hf.py
+pinned: false
+license: apache-2.0
+---
+# 🔍 OCR & Medical Document Analysis
+Порівняння DeepSeek-OCR-2 та MedGemma-1.5-4B (HuggingFace ZeroGPU Edition).
+## 🚀 Основні можливості
+- **DeepSeek-OCR-2**: MoE-architectured OCR.
+- **MedGemma-1.5-4B-IT**: Google's medical multimodal model.
+- **ZeroGPU Support**: Запуск на потужних GPU в хмарі Hugging Face.
+## 🛠 Налаштування на Hugging Face Spaces
+1. Створіть новий Space з типом SDK **Gradio**.
+2. Оберіть Hardware тип **ZeroGPU**.
+3. Додайте `HF_TOKEN` у **Settings -> Variables and secrets**, якщо плануєте використовувати MedGemma.
+4. Скопіюйте вміст `app_hf.py` (як `app.py`) та `requirements.txt`.
+## 📦 Залежності
+Всі необхідні бібліотеки вказані у `requirements.txt`.

app.py ADDED Viewed

	@@ -0,0 +1,264 @@

+import gradio as gr
+from transformers import AutoModel, AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
+import torch
+import os
+from PIL import Image
+import tempfile
+import datetime
+import fitz  # PyMuPDF
+import io
+import gc
+# --- Configuration ---
+DEEPSEEK_MODEL = 'deepseek-ai/DeepSeek-OCR-2'
+MEDGEMMA_MODEL = 'google/medgemma-1.5-4b-it'
+# --- Device Setup ---
+if torch.backends.mps.is_available():
+    print("Using MPS device")
+    device = "mps"
+    # Patch for DeepSeek custom code which uses .cuda()
+    torch.Tensor.cuda = lambda self, *args, **kwargs: self.to("mps")
+    torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to("mps")
+    dtype = torch.float16
+else:
+    device = "cpu"
+    dtype = torch.float32
+class ModelManager:
+    def __init__(self):
+        self.current_model_name = None
+        self.model = None
+        self.processor = None
+        self.tokenizer = None
+    def unload_current_model(self):
+        if self.model is not None:
+            print(f"Unloading {self.current_model_name}...")
+            del self.model
+            del self.processor
+            del self.tokenizer
+            self.model = None
+            self.processor = None
+            self.tokenizer = None
+            self.current_model_name = None
+            if torch.backends.mps.is_available():
+                torch.mps.empty_cache()
+            gc.collect()
+    def load_model(self, model_name):
+        if self.current_model_name == model_name:
+            return self.model, self.processor or self.tokenizer
+        self.unload_current_model()
+        print(f"Loading {model_name}...")
+        if model_name == DEEPSEEK_MODEL:
+            self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+            self.model = AutoModel.from_pretrained(
+                model_name,
+                trust_remote_code=True,
+                use_safetensors=True
+            )
+            self.model = self.model.to(device=device, dtype=dtype)
+            self.model.eval()
+            self.current_model_name = model_name
+            return self.model, self.tokenizer
+        elif model_name == MEDGEMMA_MODEL:
+            self.processor = AutoProcessor.from_pretrained(model_name)
+            self.model = AutoModelForImageTextToText.from_pretrained(
+                model_name,
+                trust_remote_code=True,
+                torch_dtype=dtype if device == "mps" else torch.float32,
+                device_map="auto" if device != "mps" else None
+            )
+            if device == "mps":
+                self.model = self.model.to("mps")
+            self.model.eval()
+            self.current_model_name = model_name
+            return self.model, self.processor
+manager = ModelManager()
+def pdf_to_images(pdf_path):
+    doc = fitz.open(pdf_path)
+    images = []
+    for page in doc:
+        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
+        img_data = pix.tobytes("png")
+        img = Image.open(io.BytesIO(img_data))
+        images.append(img)
+    doc.close()
+    return images
+def run_ocr(input_image, input_file, model_choice, custom_prompt):
+    images_to_process = []
+    if input_file is not None:
+        if input_file.name.lower().endswith(".pdf"):
+            try:
+                images_to_process = pdf_to_images(input_file.name)
+            except Exception as e:
+                return f"Помилка читання PDF: {str(e)}"
+        else:
+            try:
+                images_to_process = [Image.open(input_file.name)]
+            except Exception as e:
+                return f"Помилка завантаження файлу: {str(e)}"
+    elif input_image is not None:
+        images_to_process = [input_image]
+    else:
+        return "Будь ласка, завантажте зображення або PDF файл."
+    model, processor_or_tokenizer = manager.load_model(model_choice)
+    output_dir = 'outputs'
+    os.makedirs(output_dir, exist_ok=True)
+    all_results = []
+    for i, img in enumerate(images_to_process):
+        img = img.convert("RGB")
+        try:
+            print(f"Processing page/image {i+1} with {model_choice}...")
+            if model_choice == DEEPSEEK_MODEL:
+                with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
+                    img.save(tmp.name)
+                    tmp_path = tmp.name
+                try:
+                    with torch.no_grad():
+                        res = model.infer(
+                            processor_or_tokenizer,
+                            prompt=custom_prompt if custom_prompt else "<image>\nFree OCR. ",
+                            image_file=tmp_path,
+                            output_path=output_dir,
+                            base_size=1024,
+                            image_size=768,
+                            crop_mode=True,
+                            eval_mode=True
+                        )
+                    all_results.append(f"--- Page/Image {i+1} ---\n{res}")
+                finally:
+                    if os.path.exists(tmp_path):
+                        os.remove(tmp_path)
+            elif model_choice == MEDGEMMA_MODEL:
+                prompt_text = custom_prompt if custom_prompt else "extract all text from image"
+                messages = [
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "image", "image": img},
+                            {"type": "text", "text": prompt_text}
+                        ]
+                    }
+                ]
+                inputs = processor_or_tokenizer.apply_chat_template(
+                    messages,
+                    add_generation_prompt=True,
+                    tokenize=True,
+                    return_dict=True,
+                    return_tensors="pt"
+                ).to(model.device)
+                with torch.no_grad():
+                    output = model.generate(**inputs, max_new_tokens=4096)
+                input_len = inputs["input_ids"].shape[-1]
+                res = processor_or_tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
+                all_results.append(f"--- Page/Image {i+1} ---\n{res}")
+        except Exception as e:
+            all_results.append(f"--- Page/Image {i+1} ---\nПомилка: {str(e)}")
+    return "\n\n".join(all_results)
+def save_result_to_file(text):
+    if not text or text.startswith("Будь ласка") or text.startswith("Помилка"):
+        return None
+    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+    filename = f"ocr_result_{timestamp}.txt"
+    os.makedirs("outputs", exist_ok=True)
+    filepath = os.path.abspath(os.path.join("outputs", filename))
+    with open(filepath, "w", encoding="utf-8") as f:
+        f.write(text)
+    return filepath
+custom_css = """
+.header { text-align: center; margin-bottom: 30px; }
+.header h1 { font-size: 2.5rem; }
+.footer { text-align: center; margin-top: 50px; font-size: 0.9rem; color: #718096; }
+"""
+with gr.Blocks(title="OCR Comparison: DeepSeek vs MedGemma", css=custom_css) as demo:
+    with gr.Column():
+        gr.Markdown("# 🔍 OCR & Medical Document Analysis", elem_classes="header")
+        gr.Markdown("Порівняння DeepSeek-OCR-2 та MedGemma-1.5-4B", elem_classes="header")
+        with gr.Row():
+            with gr.Column(scale=1):
+                with gr.Tab("Зображення"):
+                    input_img = gr.Image(type="pil", label="Перетягніть зображення")
+                with gr.Tab("PDF / Файли"):
+                    input_file = gr.File(label="Завантажте PDF або інший файл")
+                model_selector = gr.Dropdown(
+                    choices=[DEEPSEEK_MODEL, MEDGEMMA_MODEL],
+                    value=DEEPSEEK_MODEL,
+                    label="Оберіть модель"
+                )
+                with gr.Accordion("Налаштування", open=False):
+                    prompt_input = gr.Textbox(
+                        value="",
+                        label="Користувацький промпт (залиште порожнім для дефолтного)",
+                        placeholder="Наприклад: Extract all text from image"
+                    )
+                with gr.Row():
+                    clear_btn = gr.Button("Очистити", variant="secondary")
+                    ocr_btn = gr.Button("Запустити аналіз", variant="primary")
+            with gr.Column(scale=1):
+                output_text = gr.Textbox(
+                    label="Результат",
+                    lines=20
+                )
+                with gr.Row():
+                    save_btn = gr.Button("Зберегти у файл 💾")
+                    download_file = gr.File(label="Завантажити результат")
+        gr.Markdown("---")
+        gr.Examples(
+            examples=[["sample_test.png", None, DEEPSEEK_MODEL, ""]],
+            inputs=[input_img, input_file, model_selector, prompt_input]
+        )
+    # Event handlers
+    ocr_btn.click(
+        fn=run_ocr,
+        inputs=[input_img, input_file, model_selector, prompt_input],
+        outputs=output_text
+    )
+    save_btn.click(
+        fn=save_result_to_file,
+        inputs=output_text,
+        outputs=download_file
+    )
+    def clear_all():
+        return None, None, ""
+    clear_btn.click(
+        fn=clear_all,
+        inputs=None,
+        outputs=[input_img, input_file, output_text]
+    )
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", share=False)

app_hf.py ADDED Viewed

	@@ -0,0 +1,260 @@

+import gradio as gr
+from transformers import AutoModel, AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
+import torch
+import os
+from PIL import Image
+import tempfile
+import datetime
+import fitz  # PyMuPDF
+import io
+import gc
+# Try to import spaces, if not available (local run), create a dummy decorator
+try:
+    import spaces
+except ImportError:
+    class spaces:
+        @staticmethod
+        def GPU(func):
+            return func
+# --- Configuration ---
+DEEPSEEK_MODEL = 'deepseek-ai/DeepSeek-OCR-2'
+MEDGEMMA_MODEL = 'google/medgemma-1.5-4b-it'
+# --- Device Setup ---
+# For HF Spaces with ZeroGPU, we'll use cuda if available
+device = "cuda" if torch.cuda.is_available() else "cpu"
+dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+class ModelManager:
+    def __init__(self):
+        self.models = {}
+        self.processors = {}
+    def get_model(self, model_name):
+        if model_name not in self.models:
+            print(f"Loading {model_name} to CPU...")
+            if model_name == DEEPSEEK_MODEL:
+                tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+                model = AutoModel.from_pretrained(
+                    model_name,
+                    trust_remote_code=True,
+                    use_safetensors=True,
+                    torch_dtype=dtype
+                )
+                model.eval()
+                self.models[model_name] = model
+                self.processors[model_name] = tokenizer
+            elif model_name == MEDGEMMA_MODEL:
+                processor = AutoProcessor.from_pretrained(model_name)
+                model = AutoModelForImageTextToText.from_pretrained(
+                    model_name,
+                    trust_remote_code=True,
+                    torch_dtype=dtype
+                )
+                model.eval()
+                self.models[model_name] = model
+                self.processors[model_name] = processor
+        return self.models[model_name], self.processors[model_name]
+manager = ModelManager()
+def pdf_to_images(pdf_path):
+    doc = fitz.open(pdf_path)
+    images = []
+    for page in doc:
+        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
+        img_data = pix.tobytes("png")
+        img = Image.open(io.BytesIO(img_data))
+        images.append(img)
+    doc.close()
+    return images
+@spaces.GPU(duration=120)
+def run_ocr(input_image, input_file, model_choice, custom_prompt):
+    images_to_process = []
+    if input_file is not None:
+        if input_file.name.lower().endswith(".pdf"):
+            try:
+                images_to_process = pdf_to_images(input_file.name)
+            except Exception as e:
+                return f"Помилка читання PDF: {str(e)}"
+        else:
+            try:
+                images_to_process = [Image.open(input_file.name)]
+            except Exception as e:
+                return f"Помилка завантаження файлу: {str(e)}"
+    elif input_image is not None:
+        images_to_process = [input_image]
+    else:
+        return "Будь ласка, завантажте зображення або PDF файл."
+    try:
+        model, processor_or_tokenizer = manager.get_model(model_choice)
+        # Move to GPU only inside the decorated function
+        print(f"Moving {model_choice} to GPU...")
+        model.to("cuda")
+    except Exception as e:
+        return f"Помилка завантаження чи переміщення моделі: {str(e)}\nЯкщо це MedGemma, переконайтеся, що ви надали HF_TOKEN."
+    output_dir = 'outputs'
+    os.makedirs(output_dir, exist_ok=True)
+    all_results = []
+    try:
+        for i, img in enumerate(images_to_process):
+            img = img.convert("RGB")
+            try:
+                print(f"Processing page/image {i+1} with {model_choice}...")
+                if model_choice == DEEPSEEK_MODEL:
+                    with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
+                        img.save(tmp.name)
+                        tmp_path = tmp.name
+                    try:
+                        with torch.no_grad():
+                            res = model.infer(
+                                processor_or_tokenizer,
+                                prompt=custom_prompt if custom_prompt else "<image>\nFree OCR. ",
+                                image_file=tmp_path,
+                                output_path=output_dir,
+                                base_size=1024,
+                                image_size=768,
+                                crop_mode=True,
+                                eval_mode=True
+                            )
+                        all_results.append(f"--- Page/Image {i+1} ---\n{res}")
+                    finally:
+                        if os.path.exists(tmp_path):
+                            os.remove(tmp_path)
+                elif model_choice == MEDGEMMA_MODEL:
+                    prompt_text = custom_prompt if custom_prompt else "extract all text from image"
+                    messages = [
+                        {
+                            "role": "user",
+                            "content": [
+                                {"type": "image", "image": img},
+                                {"type": "text", "text": prompt_text}
+                            ]
+                        }
+                    ]
+                    inputs = processor_or_tokenizer.apply_chat_template(
+                        messages,
+                        add_generation_prompt=True,
+                        tokenize=True,
+                        return_dict=True,
+                        return_tensors="pt"
+                    ).to("cuda") # Ensure inputs are on cuda
+                    with torch.no_grad():
+                        output = model.generate(**inputs, max_new_tokens=4096)
+                    input_len = inputs["input_ids"].shape[-1]
+                    res = processor_or_tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
+                    all_results.append(f"--- Page/Image {i+1} ---\n{res}")
+            except Exception as e:
+                all_results.append(f"--- Page/Image {i+1} ---\nПомилка: {str(e)}")
+    finally:
+        # Move back to CPU and clean up to free ZeroGPU resources
+        print(f"Moving {model_choice} back to CPU...")
+        model.to("cpu")
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+    return "\n\n".join(all_results)
+    return "\n\n".join(all_results)
+def save_result_to_file(text):
+    if not text or text.startswith("Будь ласка") or text.startswith("Помилка"):
+        return None
+    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+    filename = f"ocr_result_{timestamp}.txt"
+    os.makedirs("outputs", exist_ok=True)
+    filepath = os.path.abspath(os.path.join("outputs", filename))
+    with open(filepath, "w", encoding="utf-8") as f:
+        f.write(text)
+    return filepath
+custom_css = """
+.header { text-align: center; margin-bottom: 30px; }
+.header h1 { font-size: 2.5rem; }
+.footer { text-align: center; margin-top: 50px; font-size: 0.9rem; color: #718096; }
+"""
+with gr.Blocks(title="OCR Comparison: DeepSeek vs MedGemma", css=custom_css) as demo:
+    with gr.Column():
+        gr.Markdown("# 🔍 OCR & Medical Document Analysis", elem_classes="header")
+        gr.Markdown("Порівняння DeepSeek-OCR-2 та MedGemma-1.5-4B (HuggingFace ZeroGPU Edition)", elem_classes="header")
+        with gr.Row():
+            with gr.Column(scale=1):
+                with gr.Tab("Зображення"):
+                    input_img = gr.Image(type="pil", label="Перетягніть зображення")
+                with gr.Tab("PDF / Файли"):
+                    input_file = gr.File(label="Завантажте PDF або інший файл")
+                model_selector = gr.Dropdown(
+                    choices=[DEEPSEEK_MODEL, MEDGEMMA_MODEL],
+                    value=DEEPSEEK_MODEL,
+                    label="Оберіть модель"
+                )
+                with gr.Accordion("Налаштування", open=False):
+                    prompt_input = gr.Textbox(
+                        value="",
+                        label="Користувацький промпт (залиште порожнім для дефолтного)",
+                        placeholder="Наприклад: Extract all text from image"
+                    )
+                with gr.Row():
+                    clear_btn = gr.Button("Очистити", variant="secondary")
+                    ocr_btn = gr.Button("Запустити аналіз", variant="primary")
+            with gr.Column(scale=1):
+                output_text = gr.Textbox(
+                    label="Результат",
+                    lines=20
+                )
+                with gr.Row():
+                    save_btn = gr.Button("Зберегти у файл 💾")
+                    download_file = gr.File(label="Завантажити результат")
+        gr.Markdown("---")
+        gr.Markdown("### Як використовувати:\n1. Завантажте зображення або PDF.\n2. Виберіть модель.\n3. Натисніть 'Запустити аналіз'.\n*Примітка: MedGemma потребує HF_TOKEN з доступом до моделі.*")
+    # Event handlers
+    ocr_btn.click(
+        fn=run_ocr,
+        inputs=[input_img, input_file, model_selector, prompt_input],
+        outputs=output_text
+    )
+    save_btn.click(
+        fn=save_result_to_file,
+        inputs=output_text,
+        outputs=download_file
+    )
+    def clear_all():
+        return None, None, ""
+    clear_btn.click(
+        fn=clear_all,
+        inputs=None,
+        outputs=[input_img, input_file, output_text]
+    )
+if __name__ == "__main__":
+    demo.queue().launch()

compare_models.py ADDED Viewed

	@@ -0,0 +1,105 @@

+import torch
+from transformers import AutoModel, AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
+from PIL import Image
+import fitz
+import os
+import time
+DEEPSEEK_MODEL = 'deepseek-ai/DeepSeek-OCR-2'
+MEDGEMMA_MODEL = 'google/medgemma-1.5-4b-it'
+if torch.backends.mps.is_available():
+    print("Patching torch for MPS compatibility...")
+    device = "mps"
+    torch.Tensor.cuda = lambda self, *args, **kwargs: self.to("mps")
+    torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to("mps")
+    torch.bfloat16 = torch.float16
+    dtype = torch.float16
+else:
+    device = "cpu"
+    dtype = torch.float32
+def get_page_image(pdf_path, page_num=0):
+    doc = fitz.open(pdf_path)
+    page = doc.load_page(page_num)
+    pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
+    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+    doc.close()
+    return img
+def run_deepseek(img):
+    tokenizer = AutoTokenizer.from_pretrained(DEEPSEEK_MODEL, trust_remote_code=True)
+    model = AutoModel.from_pretrained(DEEPSEEK_MODEL, trust_remote_code=True, use_safetensors=True)
+    model = model.to(device=device, dtype=dtype).eval()
+    with torch.no_grad():
+        # Need a temp file for deepseek's .infer
+        img.save("temp_comp.png")
+        res = model.infer(
+            tokenizer,
+            prompt="<image>\nFree OCR. ",
+            image_file="temp_comp.png",
+            output_path="outputs",
+            base_size=1024,
+            image_size=768,
+            crop_mode=True,
+            eval_mode=True
+        )
+        os.remove("temp_comp.png")
+    return res
+def run_medgemma(img):
+    processor = AutoProcessor.from_pretrained(MEDGEMMA_MODEL)
+    model = AutoModelForImageTextToText.from_pretrained(
+        MEDGEMMA_MODEL,
+        trust_remote_code=True,
+        dtype=dtype if device == "mps" else torch.float32,
+        device_map="auto" if device != "mps" else None
+    ).eval()
+    if device == "mps":
+        model = model.to("mps")
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image", "image": img},
+                {"type": "text", "text": "Extract all text from this medical document."}
+            ]
+        }
+    ]
+    inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        output = model.generate(**inputs, max_new_tokens=2048)
+    input_len = inputs["input_ids"].shape[-1]
+    return processor.decode(output[0][input_len:], skip_special_tokens=True)
+def compare():
+    pdf_path = "doc_for_testing/pdf12_un.pdf"
+    if not os.path.exists(pdf_path):
+        print("PDF not found.")
+        return
+    img = get_page_image(pdf_path)
+    print("\n--- Running DeepSeek-OCR-2 ---")
+    start = time.time()
+    ds_res = run_deepseek(img)
+    print(f"Time: {time.time() - start:.2f}s")
+    print("\n--- Running MedGemma-1.5-4B ---")
+    start = time.time()
+    mg_res = run_medgemma(img)
+    print(f"Time: {time.time() - start:.2f}s")
+    with open("model_comparison.md", "w") as f:
+        f.write("# Comparison Report: DeepSeek-OCR-2 vs MedGemma-1.5-4B\n\n")
+        f.write("## DeepSeek-OCR-2 Result\n\n")
+        f.write(ds_res + "\n\n")
+        f.write("## MedGemma-1.5-4B Result\n\n")
+        f.write(mg_res + "\n")
+if __name__ == "__main__":
+    compare()

convert_docs.py ADDED Viewed

	@@ -0,0 +1,31 @@

+import fitz  # PyMuPDF
+import os
+from PIL import Image
+def convert_pdf_to_images(pdf_path, output_dir):
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    doc = fitz.open(pdf_path)
+    base_name = os.path.basename(pdf_path).split('.')[0]
+    image_paths = []
+    # Just take the first page for testing to save time/memory
+    for i in range(min(1, len(doc))):
+        page = doc.load_page(i)
+        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2)) # Zoom for better OCR
+        output_file = os.path.join(output_dir, f"{base_name}_page_{i+1}.png")
+        # Convert to PIL Image
+        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+        img.save(output_file)
+        image_paths.append(output_file)
+        print(f"Converted {pdf_path} page {i+1} to {output_file}")
+    return image_paths
+if __name__ == "__main__":
+    pdf_dir = "doc_for_testing"
+    output_dir = "doc_images"
+    for filename in os.listdir(pdf_dir):
+        if filename.endswith(".pdf"):
+            convert_pdf_to_images(os.path.join(pdf_dir, filename), output_dir)

convert_full_pdf.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import fitz  # PyMuPDF
+import os
+from PIL import Image
+def convert_full_pdf_to_images(pdf_path, output_dir):
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    doc = fitz.open(pdf_path)
+    base_name = os.path.basename(pdf_path).split('.')[0]
+    image_paths = []
+    print(f"Converting all {len(doc)} pages of {pdf_path}...")
+    for i in range(len(doc)):
+        page = doc.load_page(i)
+        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2)) # Zoom for better OCR
+        output_file = os.path.join(output_dir, f"{base_name}_page_{i+1}.png")
+        # Convert to PIL Image
+        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+        img.save(output_file)
+        image_paths.append(output_file)
+        print(f"Converted page {i+1}/{len(doc)}")
+    return image_paths
+if __name__ == "__main__":
+    pdf_file = "doc_for_testing/pdf12_un.pdf"
+    output_dir = "doc_images_full"
+    if os.path.exists(pdf_file):
+        convert_full_pdf_to_images(pdf_file, output_dir)
+    else:
+        print(f"File {pdf_file} not found.")

generate_test_image.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from PIL import Image, ImageDraw, ImageFont
+import os
+def create_sample_image(text, filename):
+    # Create a white image
+    img = Image.new('RGB', (800, 400), color=(255, 255, 255))
+    d = ImageDraw.Draw(img)
+    # Try to load a default font
+    try:
+        # On macOS, this might work
+        font = ImageFont.truetype("/System/Library/Fonts/Supplemental/Arial.ttf", 40)
+    except:
+        font = ImageFont.load_default()
+    d.text((100, 150), text, fill=(0, 0, 0), font=font)
+    img.save(filename)
+    print(f"Created {filename}")
+if __name__ == "__main__":
+    create_sample_image("DeepSeek-OCR-2 test. Hello World!", "sample_test.png")

ocr_full_pdf12.py ADDED Viewed

	@@ -0,0 +1,89 @@

+from transformers import AutoModel, AutoTokenizer
+import torch
+import os
+from PIL import Image
+import time
+# Force CPU for stability on Mac
+device = "cpu"
+print(f"Using device: {device}")
+# Patch to avoid CUDA calls in custom code
+torch.Tensor.cuda = lambda self, *args, **kwargs: self.to(device)
+torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to(device)
+model_name = 'deepseek-ai/DeepSeek-OCR-2'
+def ocr_full_document():
+    print(f"Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    print(f"Loading model...")
+    model = AutoModel.from_pretrained(
+        model_name,
+        trust_remote_code=True,
+        use_safetensors=True
+    )
+    model = model.eval()
+    # Overwrite bfloat16 to float32 for CPU compatibility
+    torch.bfloat16 = torch.float32
+    image_dir = "doc_images_full"
+    output_dir = "ocr_results_pdf12"
+    os.makedirs(output_dir, exist_ok=True)
+    # Get images sorted by page number
+    import re
+    def get_page_num(filename):
+        match = re.search(r'page_(\d+)', filename)
+        return int(match.group(1)) if match else 0
+    images = sorted([f for f in os.listdir(image_dir) if f.endswith(".png")], key=get_page_num)
+    full_markdown = []
+    for i, img_name in enumerate(images):
+        img_path = os.path.join(image_dir, img_name)
+        print(f"\n[{i+1}/{len(images)}] Processing page {get_page_num(img_name)}...")
+        prompt = "<image>\nFree OCR. "
+        start_time = time.time()
+        try:
+            with torch.no_grad():
+                res = model.infer(
+                    tokenizer,
+                    prompt=prompt,
+                    image_file=img_path,
+                    output_path=output_dir,
+                    base_size=1024,
+                    image_size=768,
+                    crop_mode=False,
+                    eval_mode=True
+                )
+            elapsed = time.time() - start_time
+            print(f"  Done in {elapsed:.2f}s")
+            # Save individual page result
+            page_file = os.path.join(output_dir, f"{img_name}.md")
+            with open(page_file, "w") as f:
+                f.write(res)
+            full_markdown.append(f"## Page {get_page_num(img_name)}\n\n{res}\n\n---\n")
+        except Exception as e:
+            print(f"  Failed: {e}")
+            full_markdown.append(f"## Page {get_page_num(img_name)}\n\n[OCR FAILED]\n\n---\n")
+    # Save combined result
+    combined_file = os.path.join(output_dir, "full_document.md")
+    with open(combined_file, "w") as f:
+        f.write("# OCR Result for pdf12_un.pdf\n\n")
+        f.write("".join(full_markdown))
+    print(f"\nCompleted! Full result saved to: {combined_file}")
+if __name__ == "__main__":
+    ocr_full_document()

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+torch
+transformers>=4.45.0
+tokenizers
+einops
+addict
+easydict
+accelerate
+sentencepiece
+pillow
+matplotlib
+requests
+torchvision
+gradio
+pymupdf
+spaces
+huggingface-hub

test_inference.py ADDED Viewed

	@@ -0,0 +1,66 @@

+from transformers import AutoModel, AutoTokenizer
+import torch
+import torch.nn as nn
+import os
+from PIL import Image, ImageOps
+import math
+# Force CPU
+device = "cpu"
+dtype = torch.float32
+print(f"Forcing device: {device} with dtype: {dtype}")
+# Patch torch types to avoid mixed precision errors in their custom code
+torch.bfloat16 = torch.float32  # Force bfloat16 to float32
+torch.Tensor.cuda = lambda self, *args, **kwargs: self.to("cpu")
+torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to("cpu")
+model_name = 'deepseek-ai/DeepSeek-OCR-2'
+def test_inference():
+    print(f"Loading tokenizer for {model_name}...")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    print(f"Loading model for {model_name}...")
+    model = AutoModel.from_pretrained(
+        model_name,
+        trust_remote_code=True,
+        use_safetensors=True,
+        torch_dtype=torch.float32 # Explicitly float32
+    )
+    model = model.eval() # Already on CPU by default if no device_map
+    output_dir = 'outputs'
+    os.makedirs(output_dir, exist_ok=True)
+    prompt = "<image>\nFree OCR. "
+    image_file = 'sample_test.png'
+    if not os.path.exists(image_file):
+        print(f"Error: {image_file} not found.")
+        return
+    print("Running inference on CPU...")
+    try:
+        with torch.no_grad():
+            res = model.infer(
+                tokenizer,
+                prompt=prompt,
+                image_file=image_file,
+                output_path=output_dir,
+                base_size=512,
+                image_size=384,
+                crop_mode=False,
+                eval_mode=True
+            )
+        print("\n--- OCR Result ---")
+        print(res)
+        print("------------------")
+    except Exception as e:
+        print(f"Inference failed: {e}")
+        import traceback
+        traceback.print_exc()
+if __name__ == "__main__":
+    test_inference()

test_medgemma.py ADDED Viewed

	@@ -0,0 +1,64 @@

+from transformers import AutoProcessor, AutoModelForImageTextToText
+import torch
+from PIL import Image
+import os
+model_id = "google/medgemma-1.5-4b-it"
+def test_medgemma():
+    print(f"Loading {model_id}...")
+    try:
+        processor = AutoProcessor.from_pretrained(model_id)
+        # We try to load without device_map="auto" for MPS or manual device control
+        model = AutoModelForImageTextToText.from_pretrained(
+            model_id,
+            torch_dtype=torch.float32, # CPU usually stable with float32
+            trust_remote_code=True
+        ).eval()
+        print("Model loaded.")
+        image_path = "sample_test.png"
+        if not os.path.exists(image_path):
+            print("No test image found.")
+            return
+        image = Image.open(image_path).convert("RGB")
+        # Use chat template as suggested
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image", "image": image},
+                    {"type": "text", "text": "Extract all text from this image."}
+                ]
+            }
+        ]
+        inputs = processor.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            tokenize=True,
+            return_dict=True,
+            return_tensors="pt"
+        )
+        print("Running inference...")
+        with torch.no_grad():
+            output = model.generate(**inputs, max_new_tokens=100)
+        input_len = inputs["input_ids"].shape[-1]
+        result = processor.decode(output[0][input_len:], skip_special_tokens=True)
+        print("\n--- MedGemma Result ---")
+        print(result)
+        print("-----------------------")
+    except Exception as e:
+        print(f"Error: {e}")
+        import traceback
+        traceback.print_exc()
+if __name__ == "__main__":
+    test_medgemma()

test_minimal.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from transformers import AutoModel, AutoTokenizer
+import torch
+import os
+from PIL import Image
+model_name = 'deepseek-ai/DeepSeek-OCR-2'
+def test_inference():
+    print(f"Loading tokenizer for {model_name}...")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    print(f"Loading model for {model_name}...")
+    # Load model on CPU
+    model = AutoModel.from_pretrained(
+        model_name,
+        trust_remote_code=True,
+        use_safetensors=True
+    )
+    # Check if loaded
+    print("Model loaded successfully.")
+    print(f"Model type: {type(model)}")
+    # Test simple tokenization
+    inputs = tokenizer("Hello", return_tensors="pt")
+    print("Tokenizer test: Success")
+    print("DeepSeek-OCR-2 is ready for use.")
+if __name__ == "__main__":
+    test_inference()

test_real_docs.py ADDED Viewed

	@@ -0,0 +1,82 @@

+from transformers import AutoModel, AutoTokenizer
+import torch
+import os
+from PIL import Image
+import time
+# Force CPU for stability
+device = "cpu"
+print(f"Using device: {device}")
+# Patch to avoid CUDA calls in custom code
+torch.Tensor.cuda = lambda self, *args, **kwargs: self.to(device)
+torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to(device)
+model_name = 'deepseek-ai/DeepSeek-OCR-2'
+def test_docs():
+    print(f"Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    print(f"Loading model (may take a minute)...")
+    # Load with default parameters that worked in test_minimal.py
+    model = AutoModel.from_pretrained(
+        model_name,
+        trust_remote_code=True,
+        use_safetensors=True
+    )
+    model = model.eval()
+    # After loading, we monkeypatch bfloat16 for the inference logic
+    torch.bfloat16 = torch.float32
+    image_dir = "doc_images"
+    output_dir = "ocr_results"
+    os.makedirs(output_dir, exist_ok=True)
+    images = sorted([f for f in os.listdir(image_dir) if f.endswith(".png")])
+    for img_name in images:
+        img_path = os.path.join(image_dir, img_name)
+        print(f"\n--- Processing: {img_name} ---")
+        # DeepSeek-OCR-2 needs specific ratios for its hardcoded query embeddings
+        # base_size=1024 -> n_query=256 (supported)
+        # image_size=768 -> n_query=144 (supported)
+        prompt = "<image>\nFree OCR. "
+        start_time = time.time()
+        try:
+            with torch.no_grad():
+                res = model.infer(
+                    tokenizer,
+                    prompt=prompt,
+                    image_file=img_path,
+                    output_path=output_dir,
+                    base_size=1024, # Must be 1024 for 256 queries
+                    image_size=768,  # Must be 768 for 144 queries
+                    crop_mode=False,
+                    eval_mode=True
+                )
+            elapsed = time.time() - start_time
+            print(f"Done in {elapsed:.2f}s")
+            result_file = os.path.join(output_dir, f"{img_name}.md")
+            with open(result_file, "w") as f:
+                f.write(res)
+            print(f"Result saved to {result_file}")
+            print("Preview (first 500 chars):")
+            print("-" * 20)
+            print(res[:500] + "...")
+            print("-" * 20)
+        except Exception as e:
+            print(f"Inference failed for {img_name}: {e}")
+            import traceback
+            traceback.print_exc()
+if __name__ == "__main__":
+    test_docs()