Spaces:
Running
on
Zero
Running
on
Zero
Commit
·
b752d16
0
Parent(s):
Initial commit: DeepSeek-OCR-2 & MedGemma-1.5 multimodal analysis app with ZeroGPU support
Browse files- .gitignore +32 -0
- OCR_ANALYSIS_REPORT.md +57 -0
- README.md +60 -0
- README_HF.md +31 -0
- app.py +264 -0
- app_hf.py +260 -0
- compare_models.py +105 -0
- convert_docs.py +31 -0
- convert_full_pdf.py +32 -0
- generate_test_image.py +21 -0
- ocr_full_pdf12.py +89 -0
- requirements.txt +16 -0
- test_inference.py +66 -0
- test_medgemma.py +64 -0
- test_minimal.py +31 -0
- test_real_docs.py +82 -0
.gitignore
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Virtual environment
|
| 2 |
+
venv/
|
| 3 |
+
.venv/
|
| 4 |
+
env/
|
| 5 |
+
|
| 6 |
+
# Data and Results
|
| 7 |
+
doc_for_testing/
|
| 8 |
+
doc_images/
|
| 9 |
+
doc_images_full/
|
| 10 |
+
ocr_results/
|
| 11 |
+
ocr_results_package/
|
| 12 |
+
ocr_results_pdf12/
|
| 13 |
+
outputs/
|
| 14 |
+
|
| 15 |
+
# Temporary and generated files
|
| 16 |
+
*.zip
|
| 17 |
+
*.jpg
|
| 18 |
+
*.png
|
| 19 |
+
*.pdf
|
| 20 |
+
!sample_test.png # In case this one is needed for examples in app or readme, but user said "all folders with tests and results"
|
| 21 |
+
temp_comp.png
|
| 22 |
+
ocr_result_*.txt
|
| 23 |
+
|
| 24 |
+
# Python cache
|
| 25 |
+
__pycache__/
|
| 26 |
+
*.py[cod]
|
| 27 |
+
*$py.class
|
| 28 |
+
|
| 29 |
+
# IDEs
|
| 30 |
+
.vscode/
|
| 31 |
+
.idea/
|
| 32 |
+
.DS_Store
|
OCR_ANALYSIS_REPORT.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Аналіз продуктивності та точності DeepSeek-OCR-2
|
| 2 |
+
|
| 3 |
+
**Дата:** 28 січня 2026
|
| 4 |
+
**Тестовий файл:** `doc_for_testing/pdf12_un.pdf` (13 сторінок)
|
| 5 |
+
**Середовище:** Apple M3 Max (CPU Inference, float32)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Аналіз точності (Accuracy)
|
| 10 |
+
|
| 11 |
+
**Загальна оцінка:** 8/10
|
| 12 |
+
|
| 13 |
+
Модель демонструє високий рівень розуміння контексту тa структури документа, але має специфічні проблеми, характерні для Великих Мовних Моделей (LLM).
|
| 14 |
+
|
| 15 |
+
### ✅ Сильні сторони
|
| 16 |
+
* **Глибоке розуміння контексту:** Модель чудово розрізняє секції документа ("Impression", "Plan", "Vitals"). Вихідний формат Markdown чистий і готовий до використання.
|
| 17 |
+
* **Медична термінологія:** Специфічні терміни розпізнані майже бездоганно (напр., *Gastroesophageal reflux disease*, *Cholecystectomy*, *Tissue Transglutaminase*).
|
| 18 |
+
* **Робота з таблицями:** Модель коректно перетворює візуальні таблиці у Markdown-таблиці, зберігаючи логічний зв'язок даних.
|
| 19 |
+
* **Стійкість до шумів:** Добре справляється з різними шрифтами та форматуванням.
|
| 20 |
+
|
| 21 |
+
### ⚠️ Критичні проблеми (Слабкі сторони)
|
| 22 |
+
* **Галюцинації у власних назвах (Hallucinations):** Це найсерйозніша проблема. Модель схильна "додумувати" назви брендів чи організацій, якщо текст нечіткий або логотип складний.
|
| 23 |
+
* *Atrium Health* $\rightarrow$ розпізнано як **"Arthur Health"**.
|
| 24 |
+
* *Carolina Imaging Services* $\rightarrow$ розпізнано як **"Carlos Alings Ingegvers"**.
|
| 25 |
+
* **Дрібні помилки розпізнавання:**
|
| 26 |
+
* *Post-menopausal* $\rightarrow$ **"Pilot-menopausal"**.
|
| 27 |
+
* Дублювання відповідей у чек-лістах (напр., "No No" замість "No").
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 2. Аналіз швидкості (Performance)
|
| 32 |
+
|
| 33 |
+
**Загальна оцінка (CPU):** 6/10
|
| 34 |
+
|
| 35 |
+
Швидкість тестувалася на CPU через обмежену підтримку MPS (Metal Performance Shaders) для специфічних шарів MoE (Mixture of Experts) у поточній версії коду DeepSeek.
|
| 36 |
+
|
| 37 |
+
* **Середній час на сторінку:** ~19-20 секунд.
|
| 38 |
+
* *Найшвидша:* ~7.4 с (сторінки з малою кількістю тексту).
|
| 39 |
+
* *Найповільніша:* ~29 с (насичені сторінки).
|
| 40 |
+
* **Повний цикл (13 сторінок):** ~4.5 - 5 хвилин.
|
| 41 |
+
|
| 42 |
+
**Висновок по швидкості:** На CPU модель придатна лише для фонової пакетної обробки (batch processing). Для інтерактивної роботи (real-time) швидкість є недостатньою.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 3. Рекомендації
|
| 47 |
+
|
| 48 |
+
### Для покращення точності:
|
| 49 |
+
1. **Пост-обробка (Post-processing):** Впровадити словник-валідатор для критично важливих сутностей (Known Entities). Наприклад, автоматична заміна "Arthur Health" на "Atrium Health" на основі списку відомих клінік.
|
| 50 |
+
2. **Гібридний підхід:** Використовувати класичний OCR (наприклад, Tesseract або PaddleOCR) для витягування точних назв ("сирого тексту"), а DeepSeek-OCR-2 використовувати для структурування та розуміння семантики.
|
| 51 |
+
|
| 52 |
+
### Для покращення швидкості:
|
| 53 |
+
1. **GPU Інференс:** Перехід на NVIDIA GPU (CUDA) є обов'язковим для продакшн-середовища. Це дозволить прискорити обробку в 10-20 разів (до ~1-2 секунд на сторінку).
|
| 54 |
+
2. **Квантування (Quantization):** Розглянути можливість використання 4-bit або 8-bit квантування (GGUF/AWQ), якщо точність не постраждає критично. Це значно прискорить роботу навіть на CPU/Mac.
|
| 55 |
+
|
| 56 |
+
### Цільове використання:
|
| 57 |
+
DeepSeek-OCR-2 ідеально підходить для **ETL-процесів** (Extract, Transform, Load), де потрібно перетворити неструктуровані PDF/Зображення у структуровані дані (JSON/Markdown) для подальшого аналізу. Вона менш придатна для задач, де потрібна 100% посимвольна точність без "творчості" (наприклад, розпізнавання кодів чи серійних номерів).
|
README.md
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DeepSeek-OCR-2 & MedGemma-1.5 Multimodal Analysis
|
| 2 |
+
|
| 3 |
+
Цей проект призначений для аналізу медичних та загальних документів за допомогою сучасних мультимодальних моделей: **DeepSeek-OCR-2** та **MedGemma-1.5-4B-IT**.
|
| 4 |
+
|
| 5 |
+
## 🚀 Основні можливості
|
| 6 |
+
|
| 7 |
+
- **DeepSeek-OCR-2**: Високоточне розпізнавання тексту (OCR) на основі архітектури Mixture-of-Experts (MoE).
|
| 8 |
+
- **MedGemma-1.5-4B-IT**: Мультимодальна модель від Google, спеціалізована на медичних зображеннях та текстах (архітектура Gemma 3 / PaliGemma).
|
| 9 |
+
- **Веб-інтерфейс Gradio**: Зручне завантаження зображень/PDF, вибір моделі та візуалізація результатів.
|
| 10 |
+
- **Порівняння моделей**: Спеціальний інструмент для одночасного аналізу однієї сторінки обома моделями.
|
| 11 |
+
- **Оптимізація для Mac**: Патчі для підтримки MPS (Metal Performance Shaders) та виправлення сумісності з новими версіями `transformers`.
|
| 12 |
+
|
| 13 |
+
## 📦 Склад проекту
|
| 14 |
+
|
| 15 |
+
- `app.py`: Головний застосунок з інтерфейсом Gradio.
|
| 16 |
+
- `compare_models.py`: Скрипт для порівняльного аналізу DeepSeek та MedGemma.
|
| 17 |
+
- `test_medgemma.py`: Тестовий скрипт для перевірки працездатності MedGemma.
|
| 18 |
+
- `outputs/`: Директорія для збереження результатів аналізу.
|
| 19 |
+
- `venv/`: Віртуальне середовище Python 3.11.9.
|
| 20 |
+
|
| 21 |
+
## 🛠 Інструкція з налаштування
|
| 22 |
+
|
| 23 |
+
### 1. Підготовка середовища
|
| 24 |
+
```bash
|
| 25 |
+
# Активація віртуального середовища
|
| 26 |
+
source venv/bin/activate
|
| 27 |
+
|
| 28 |
+
# Встановлення необхідних бібліотек (якщо потрібно оновити)
|
| 29 |
+
pip install -r requirements.txt
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### 2. Доступ до MedGemma
|
| 33 |
+
Для роботи з `google/medgemma-1.5-4b-it` необхідно:
|
| 34 |
+
1. Мати аккаунт на Hugging Face.
|
| 35 |
+
2. Погодитися з умовами використання моделі на [сторінці моделі](https://huggingface.co/google/medgemma-1.5-4b-it).
|
| 36 |
+
3. Авторизуватися локально: `huggingface-cli login`.
|
| 37 |
+
|
| 38 |
+
## 🖥 Як запустити
|
| 39 |
+
|
| 40 |
+
### Запуск веб-інтерфейсу
|
| 41 |
+
```bash
|
| 42 |
+
python app.py
|
| 43 |
+
```
|
| 44 |
+
Після запуску відкрийте посилання в браузері (зазвичай `http://127.0.0.1:7860`).
|
| 45 |
+
|
| 46 |
+
### Порівняння результатів
|
| 47 |
+
```bash
|
| 48 |
+
python compare_models.py
|
| 49 |
+
```
|
| 50 |
+
Результат буде збережено у файл `model_comparison.md`.
|
| 51 |
+
|
| 52 |
+
## 🍎 Примітки для macOS (M1/M2/M3)
|
| 53 |
+
|
| 54 |
+
Проект містить автоматичні виправлення (monkeypatching) для:
|
| 55 |
+
1. **Сумісності з Transformers 5.0**: Виправлено помилки імпорту `LlamaFlashAttention2` та `DynamicCache`.
|
| 56 |
+
2. **MPS Acceleration**: Автоматичне використання GPU Mac там, де це можливо (float16).
|
| 57 |
+
3. **MoE на CPU**: Оскільки DeepSeek MoE має обмежену підтримку MPS, деякі його частини автоматично перемикаються на CPU для стабільності.
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
*Проект розроблено для тестування та демонстрації можливостей сучасних LLM у сфері розпізнавання медичних документів.*
|
README_HF.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Local OCR Demo
|
| 3 |
+
emoji: 🔍
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: indigo
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.1
|
| 8 |
+
app_file: app_hf.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# 🔍 OCR & Medical Document Analysis
|
| 14 |
+
|
| 15 |
+
Порівняння DeepSeek-OCR-2 та MedGemma-1.5-4B (HuggingFace ZeroGPU Edition).
|
| 16 |
+
|
| 17 |
+
## 🚀 Основні можливості
|
| 18 |
+
|
| 19 |
+
- **DeepSeek-OCR-2**: MoE-architectured OCR.
|
| 20 |
+
- **MedGemma-1.5-4B-IT**: Google's medical multimodal model.
|
| 21 |
+
- **ZeroGPU Support**: Запуск на потужних GPU в хмарі Hugging Face.
|
| 22 |
+
|
| 23 |
+
## 🛠 Налаштування на Hugging Face Spaces
|
| 24 |
+
|
| 25 |
+
1. Створіть новий Space з типом SDK **Gradio**.
|
| 26 |
+
2. Оберіть Hardware тип **ZeroGPU**.
|
| 27 |
+
3. Додайте `HF_TOKEN` у **Settings -> Variables and secrets**, якщо плануєте використовувати MedGemma.
|
| 28 |
+
4. Скопіюйте вміст `app_hf.py` (як `app.py`) та `requirements.txt`.
|
| 29 |
+
|
| 30 |
+
## 📦 Залежності
|
| 31 |
+
Всі необхідні бібліотеки вказані у `requirements.txt`.
|
app.py
ADDED
|
@@ -0,0 +1,264 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
from transformers import AutoModel, AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
|
| 3 |
+
import torch
|
| 4 |
+
import os
|
| 5 |
+
from PIL import Image
|
| 6 |
+
import tempfile
|
| 7 |
+
import datetime
|
| 8 |
+
import fitz # PyMuPDF
|
| 9 |
+
import io
|
| 10 |
+
import gc
|
| 11 |
+
|
| 12 |
+
# --- Configuration ---
|
| 13 |
+
DEEPSEEK_MODEL = 'deepseek-ai/DeepSeek-OCR-2'
|
| 14 |
+
MEDGEMMA_MODEL = 'google/medgemma-1.5-4b-it'
|
| 15 |
+
|
| 16 |
+
# --- Device Setup ---
|
| 17 |
+
if torch.backends.mps.is_available():
|
| 18 |
+
print("Using MPS device")
|
| 19 |
+
device = "mps"
|
| 20 |
+
# Patch for DeepSeek custom code which uses .cuda()
|
| 21 |
+
torch.Tensor.cuda = lambda self, *args, **kwargs: self.to("mps")
|
| 22 |
+
torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to("mps")
|
| 23 |
+
dtype = torch.float16
|
| 24 |
+
else:
|
| 25 |
+
device = "cpu"
|
| 26 |
+
dtype = torch.float32
|
| 27 |
+
|
| 28 |
+
class ModelManager:
|
| 29 |
+
def __init__(self):
|
| 30 |
+
self.current_model_name = None
|
| 31 |
+
self.model = None
|
| 32 |
+
self.processor = None
|
| 33 |
+
self.tokenizer = None
|
| 34 |
+
|
| 35 |
+
def unload_current_model(self):
|
| 36 |
+
if self.model is not None:
|
| 37 |
+
print(f"Unloading {self.current_model_name}...")
|
| 38 |
+
del self.model
|
| 39 |
+
del self.processor
|
| 40 |
+
del self.tokenizer
|
| 41 |
+
self.model = None
|
| 42 |
+
self.processor = None
|
| 43 |
+
self.tokenizer = None
|
| 44 |
+
self.current_model_name = None
|
| 45 |
+
if torch.backends.mps.is_available():
|
| 46 |
+
torch.mps.empty_cache()
|
| 47 |
+
gc.collect()
|
| 48 |
+
|
| 49 |
+
def load_model(self, model_name):
|
| 50 |
+
if self.current_model_name == model_name:
|
| 51 |
+
return self.model, self.processor or self.tokenizer
|
| 52 |
+
|
| 53 |
+
self.unload_current_model()
|
| 54 |
+
|
| 55 |
+
print(f"Loading {model_name}...")
|
| 56 |
+
if model_name == DEEPSEEK_MODEL:
|
| 57 |
+
self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 58 |
+
self.model = AutoModel.from_pretrained(
|
| 59 |
+
model_name,
|
| 60 |
+
trust_remote_code=True,
|
| 61 |
+
use_safetensors=True
|
| 62 |
+
)
|
| 63 |
+
self.model = self.model.to(device=device, dtype=dtype)
|
| 64 |
+
self.model.eval()
|
| 65 |
+
self.current_model_name = model_name
|
| 66 |
+
return self.model, self.tokenizer
|
| 67 |
+
|
| 68 |
+
elif model_name == MEDGEMMA_MODEL:
|
| 69 |
+
self.processor = AutoProcessor.from_pretrained(model_name)
|
| 70 |
+
self.model = AutoModelForImageTextToText.from_pretrained(
|
| 71 |
+
model_name,
|
| 72 |
+
trust_remote_code=True,
|
| 73 |
+
torch_dtype=dtype if device == "mps" else torch.float32,
|
| 74 |
+
device_map="auto" if device != "mps" else None
|
| 75 |
+
)
|
| 76 |
+
if device == "mps":
|
| 77 |
+
self.model = self.model.to("mps")
|
| 78 |
+
self.model.eval()
|
| 79 |
+
self.current_model_name = model_name
|
| 80 |
+
return self.model, self.processor
|
| 81 |
+
|
| 82 |
+
manager = ModelManager()
|
| 83 |
+
|
| 84 |
+
def pdf_to_images(pdf_path):
|
| 85 |
+
doc = fitz.open(pdf_path)
|
| 86 |
+
images = []
|
| 87 |
+
for page in doc:
|
| 88 |
+
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
|
| 89 |
+
img_data = pix.tobytes("png")
|
| 90 |
+
img = Image.open(io.BytesIO(img_data))
|
| 91 |
+
images.append(img)
|
| 92 |
+
doc.close()
|
| 93 |
+
return images
|
| 94 |
+
|
| 95 |
+
def run_ocr(input_image, input_file, model_choice, custom_prompt):
|
| 96 |
+
images_to_process = []
|
| 97 |
+
|
| 98 |
+
if input_file is not None:
|
| 99 |
+
if input_file.name.lower().endswith(".pdf"):
|
| 100 |
+
try:
|
| 101 |
+
images_to_process = pdf_to_images(input_file.name)
|
| 102 |
+
except Exception as e:
|
| 103 |
+
return f"Помилка читання PDF: {str(e)}"
|
| 104 |
+
else:
|
| 105 |
+
try:
|
| 106 |
+
images_to_process = [Image.open(input_file.name)]
|
| 107 |
+
except Exception as e:
|
| 108 |
+
return f"Помилка завантаження файлу: {str(e)}"
|
| 109 |
+
elif input_image is not None:
|
| 110 |
+
images_to_process = [input_image]
|
| 111 |
+
else:
|
| 112 |
+
return "Будь ласка, завантажте зображення або PDF файл."
|
| 113 |
+
|
| 114 |
+
model, processor_or_tokenizer = manager.load_model(model_choice)
|
| 115 |
+
|
| 116 |
+
output_dir = 'outputs'
|
| 117 |
+
os.makedirs(output_dir, exist_ok=True)
|
| 118 |
+
|
| 119 |
+
all_results = []
|
| 120 |
+
|
| 121 |
+
for i, img in enumerate(images_to_process):
|
| 122 |
+
img = img.convert("RGB")
|
| 123 |
+
try:
|
| 124 |
+
print(f"Processing page/image {i+1} with {model_choice}...")
|
| 125 |
+
if model_choice == DEEPSEEK_MODEL:
|
| 126 |
+
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
|
| 127 |
+
img.save(tmp.name)
|
| 128 |
+
tmp_path = tmp.name
|
| 129 |
+
|
| 130 |
+
try:
|
| 131 |
+
with torch.no_grad():
|
| 132 |
+
res = model.infer(
|
| 133 |
+
processor_or_tokenizer,
|
| 134 |
+
prompt=custom_prompt if custom_prompt else "<image>\nFree OCR. ",
|
| 135 |
+
image_file=tmp_path,
|
| 136 |
+
output_path=output_dir,
|
| 137 |
+
base_size=1024,
|
| 138 |
+
image_size=768,
|
| 139 |
+
crop_mode=True,
|
| 140 |
+
eval_mode=True
|
| 141 |
+
)
|
| 142 |
+
all_results.append(f"--- Page/Image {i+1} ---\n{res}")
|
| 143 |
+
finally:
|
| 144 |
+
if os.path.exists(tmp_path):
|
| 145 |
+
os.remove(tmp_path)
|
| 146 |
+
|
| 147 |
+
elif model_choice == MEDGEMMA_MODEL:
|
| 148 |
+
prompt_text = custom_prompt if custom_prompt else "extract all text from image"
|
| 149 |
+
messages = [
|
| 150 |
+
{
|
| 151 |
+
"role": "user",
|
| 152 |
+
"content": [
|
| 153 |
+
{"type": "image", "image": img},
|
| 154 |
+
{"type": "text", "text": prompt_text}
|
| 155 |
+
]
|
| 156 |
+
}
|
| 157 |
+
]
|
| 158 |
+
|
| 159 |
+
inputs = processor_or_tokenizer.apply_chat_template(
|
| 160 |
+
messages,
|
| 161 |
+
add_generation_prompt=True,
|
| 162 |
+
tokenize=True,
|
| 163 |
+
return_dict=True,
|
| 164 |
+
return_tensors="pt"
|
| 165 |
+
).to(model.device)
|
| 166 |
+
|
| 167 |
+
with torch.no_grad():
|
| 168 |
+
output = model.generate(**inputs, max_new_tokens=4096)
|
| 169 |
+
|
| 170 |
+
input_len = inputs["input_ids"].shape[-1]
|
| 171 |
+
res = processor_or_tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
|
| 172 |
+
all_results.append(f"--- Page/Image {i+1} ---\n{res}")
|
| 173 |
+
|
| 174 |
+
except Exception as e:
|
| 175 |
+
all_results.append(f"--- Page/Image {i+1} ---\nПомилка: {str(e)}")
|
| 176 |
+
|
| 177 |
+
return "\n\n".join(all_results)
|
| 178 |
+
|
| 179 |
+
def save_result_to_file(text):
|
| 180 |
+
if not text or text.startswith("Будь ласка") or text.startswith("Помилка"):
|
| 181 |
+
return None
|
| 182 |
+
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 183 |
+
filename = f"ocr_result_{timestamp}.txt"
|
| 184 |
+
os.makedirs("outputs", exist_ok=True)
|
| 185 |
+
filepath = os.path.abspath(os.path.join("outputs", filename))
|
| 186 |
+
with open(filepath, "w", encoding="utf-8") as f:
|
| 187 |
+
f.write(text)
|
| 188 |
+
return filepath
|
| 189 |
+
|
| 190 |
+
custom_css = """
|
| 191 |
+
.header { text-align: center; margin-bottom: 30px; }
|
| 192 |
+
.header h1 { font-size: 2.5rem; }
|
| 193 |
+
.footer { text-align: center; margin-top: 50px; font-size: 0.9rem; color: #718096; }
|
| 194 |
+
"""
|
| 195 |
+
|
| 196 |
+
with gr.Blocks(title="OCR Comparison: DeepSeek vs MedGemma", css=custom_css) as demo:
|
| 197 |
+
with gr.Column():
|
| 198 |
+
gr.Markdown("# 🔍 OCR & Medical Document Analysis", elem_classes="header")
|
| 199 |
+
gr.Markdown("Порівняння DeepSeek-OCR-2 та MedGemma-1.5-4B", elem_classes="header")
|
| 200 |
+
|
| 201 |
+
with gr.Row():
|
| 202 |
+
with gr.Column(scale=1):
|
| 203 |
+
with gr.Tab("Зображення"):
|
| 204 |
+
input_img = gr.Image(type="pil", label="Перетягніть зображення")
|
| 205 |
+
with gr.Tab("PDF / Файли"):
|
| 206 |
+
input_file = gr.File(label="Завантажте PDF або інший файл")
|
| 207 |
+
|
| 208 |
+
model_selector = gr.Dropdown(
|
| 209 |
+
choices=[DEEPSEEK_MODEL, MEDGEMMA_MODEL],
|
| 210 |
+
value=DEEPSEEK_MODEL,
|
| 211 |
+
label="Оберіть модель"
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
with gr.Accordion("Налаштування", open=False):
|
| 215 |
+
prompt_input = gr.Textbox(
|
| 216 |
+
value="",
|
| 217 |
+
label="Користувацький промпт (залиште порожнім для дефолтного)",
|
| 218 |
+
placeholder="Наприклад: Extract all text from image"
|
| 219 |
+
)
|
| 220 |
+
|
| 221 |
+
with gr.Row():
|
| 222 |
+
clear_btn = gr.Button("Очистити", variant="secondary")
|
| 223 |
+
ocr_btn = gr.Button("Запустити аналіз", variant="primary")
|
| 224 |
+
|
| 225 |
+
with gr.Column(scale=1):
|
| 226 |
+
output_text = gr.Textbox(
|
| 227 |
+
label="Результат",
|
| 228 |
+
lines=20
|
| 229 |
+
)
|
| 230 |
+
|
| 231 |
+
with gr.Row():
|
| 232 |
+
save_btn = gr.Button("Зберегти у файл 💾")
|
| 233 |
+
download_file = gr.File(label="Завантажити результат")
|
| 234 |
+
|
| 235 |
+
gr.Markdown("---")
|
| 236 |
+
gr.Examples(
|
| 237 |
+
examples=[["sample_test.png", None, DEEPSEEK_MODEL, ""]],
|
| 238 |
+
inputs=[input_img, input_file, model_selector, prompt_input]
|
| 239 |
+
)
|
| 240 |
+
|
| 241 |
+
# Event handlers
|
| 242 |
+
ocr_btn.click(
|
| 243 |
+
fn=run_ocr,
|
| 244 |
+
inputs=[input_img, input_file, model_selector, prompt_input],
|
| 245 |
+
outputs=output_text
|
| 246 |
+
)
|
| 247 |
+
|
| 248 |
+
save_btn.click(
|
| 249 |
+
fn=save_result_to_file,
|
| 250 |
+
inputs=output_text,
|
| 251 |
+
outputs=download_file
|
| 252 |
+
)
|
| 253 |
+
|
| 254 |
+
def clear_all():
|
| 255 |
+
return None, None, ""
|
| 256 |
+
|
| 257 |
+
clear_btn.click(
|
| 258 |
+
fn=clear_all,
|
| 259 |
+
inputs=None,
|
| 260 |
+
outputs=[input_img, input_file, output_text]
|
| 261 |
+
)
|
| 262 |
+
|
| 263 |
+
if __name__ == "__main__":
|
| 264 |
+
demo.launch(server_name="0.0.0.0", share=False)
|
app_hf.py
ADDED
|
@@ -0,0 +1,260 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
from transformers import AutoModel, AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
|
| 3 |
+
import torch
|
| 4 |
+
import os
|
| 5 |
+
from PIL import Image
|
| 6 |
+
import tempfile
|
| 7 |
+
import datetime
|
| 8 |
+
import fitz # PyMuPDF
|
| 9 |
+
import io
|
| 10 |
+
import gc
|
| 11 |
+
|
| 12 |
+
# Try to import spaces, if not available (local run), create a dummy decorator
|
| 13 |
+
try:
|
| 14 |
+
import spaces
|
| 15 |
+
except ImportError:
|
| 16 |
+
class spaces:
|
| 17 |
+
@staticmethod
|
| 18 |
+
def GPU(func):
|
| 19 |
+
return func
|
| 20 |
+
|
| 21 |
+
# --- Configuration ---
|
| 22 |
+
DEEPSEEK_MODEL = 'deepseek-ai/DeepSeek-OCR-2'
|
| 23 |
+
MEDGEMMA_MODEL = 'google/medgemma-1.5-4b-it'
|
| 24 |
+
|
| 25 |
+
# --- Device Setup ---
|
| 26 |
+
# For HF Spaces with ZeroGPU, we'll use cuda if available
|
| 27 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 28 |
+
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 29 |
+
|
| 30 |
+
class ModelManager:
|
| 31 |
+
def __init__(self):
|
| 32 |
+
self.models = {}
|
| 33 |
+
self.processors = {}
|
| 34 |
+
|
| 35 |
+
def get_model(self, model_name):
|
| 36 |
+
if model_name not in self.models:
|
| 37 |
+
print(f"Loading {model_name} to CPU...")
|
| 38 |
+
if model_name == DEEPSEEK_MODEL:
|
| 39 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 40 |
+
model = AutoModel.from_pretrained(
|
| 41 |
+
model_name,
|
| 42 |
+
trust_remote_code=True,
|
| 43 |
+
use_safetensors=True,
|
| 44 |
+
torch_dtype=dtype
|
| 45 |
+
)
|
| 46 |
+
model.eval()
|
| 47 |
+
self.models[model_name] = model
|
| 48 |
+
self.processors[model_name] = tokenizer
|
| 49 |
+
|
| 50 |
+
elif model_name == MEDGEMMA_MODEL:
|
| 51 |
+
processor = AutoProcessor.from_pretrained(model_name)
|
| 52 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 53 |
+
model_name,
|
| 54 |
+
trust_remote_code=True,
|
| 55 |
+
torch_dtype=dtype
|
| 56 |
+
)
|
| 57 |
+
model.eval()
|
| 58 |
+
self.models[model_name] = model
|
| 59 |
+
self.processors[model_name] = processor
|
| 60 |
+
|
| 61 |
+
return self.models[model_name], self.processors[model_name]
|
| 62 |
+
|
| 63 |
+
manager = ModelManager()
|
| 64 |
+
|
| 65 |
+
def pdf_to_images(pdf_path):
|
| 66 |
+
doc = fitz.open(pdf_path)
|
| 67 |
+
images = []
|
| 68 |
+
for page in doc:
|
| 69 |
+
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
|
| 70 |
+
img_data = pix.tobytes("png")
|
| 71 |
+
img = Image.open(io.BytesIO(img_data))
|
| 72 |
+
images.append(img)
|
| 73 |
+
doc.close()
|
| 74 |
+
return images
|
| 75 |
+
|
| 76 |
+
@spaces.GPU(duration=120)
|
| 77 |
+
def run_ocr(input_image, input_file, model_choice, custom_prompt):
|
| 78 |
+
images_to_process = []
|
| 79 |
+
|
| 80 |
+
if input_file is not None:
|
| 81 |
+
if input_file.name.lower().endswith(".pdf"):
|
| 82 |
+
try:
|
| 83 |
+
images_to_process = pdf_to_images(input_file.name)
|
| 84 |
+
except Exception as e:
|
| 85 |
+
return f"Помилка читання PDF: {str(e)}"
|
| 86 |
+
else:
|
| 87 |
+
try:
|
| 88 |
+
images_to_process = [Image.open(input_file.name)]
|
| 89 |
+
except Exception as e:
|
| 90 |
+
return f"Помилка завантаження файлу: {str(e)}"
|
| 91 |
+
elif input_image is not None:
|
| 92 |
+
images_to_process = [input_image]
|
| 93 |
+
else:
|
| 94 |
+
return "Будь ласка, завантажте зображення або PDF файл."
|
| 95 |
+
|
| 96 |
+
try:
|
| 97 |
+
model, processor_or_tokenizer = manager.get_model(model_choice)
|
| 98 |
+
# Move to GPU only inside the decorated function
|
| 99 |
+
print(f"Moving {model_choice} to GPU...")
|
| 100 |
+
model.to("cuda")
|
| 101 |
+
except Exception as e:
|
| 102 |
+
return f"Помилка завантаження чи переміщення моделі: {str(e)}\nЯкщо це MedGemma, переконайтеся, що ви надали HF_TOKEN."
|
| 103 |
+
|
| 104 |
+
output_dir = 'outputs'
|
| 105 |
+
os.makedirs(output_dir, exist_ok=True)
|
| 106 |
+
|
| 107 |
+
all_results = []
|
| 108 |
+
|
| 109 |
+
try:
|
| 110 |
+
for i, img in enumerate(images_to_process):
|
| 111 |
+
img = img.convert("RGB")
|
| 112 |
+
try:
|
| 113 |
+
print(f"Processing page/image {i+1} with {model_choice}...")
|
| 114 |
+
if model_choice == DEEPSEEK_MODEL:
|
| 115 |
+
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
|
| 116 |
+
img.save(tmp.name)
|
| 117 |
+
tmp_path = tmp.name
|
| 118 |
+
|
| 119 |
+
try:
|
| 120 |
+
with torch.no_grad():
|
| 121 |
+
res = model.infer(
|
| 122 |
+
processor_or_tokenizer,
|
| 123 |
+
prompt=custom_prompt if custom_prompt else "<image>\nFree OCR. ",
|
| 124 |
+
image_file=tmp_path,
|
| 125 |
+
output_path=output_dir,
|
| 126 |
+
base_size=1024,
|
| 127 |
+
image_size=768,
|
| 128 |
+
crop_mode=True,
|
| 129 |
+
eval_mode=True
|
| 130 |
+
)
|
| 131 |
+
all_results.append(f"--- Page/Image {i+1} ---\n{res}")
|
| 132 |
+
finally:
|
| 133 |
+
if os.path.exists(tmp_path):
|
| 134 |
+
os.remove(tmp_path)
|
| 135 |
+
|
| 136 |
+
elif model_choice == MEDGEMMA_MODEL:
|
| 137 |
+
prompt_text = custom_prompt if custom_prompt else "extract all text from image"
|
| 138 |
+
messages = [
|
| 139 |
+
{
|
| 140 |
+
"role": "user",
|
| 141 |
+
"content": [
|
| 142 |
+
{"type": "image", "image": img},
|
| 143 |
+
{"type": "text", "text": prompt_text}
|
| 144 |
+
]
|
| 145 |
+
}
|
| 146 |
+
]
|
| 147 |
+
|
| 148 |
+
inputs = processor_or_tokenizer.apply_chat_template(
|
| 149 |
+
messages,
|
| 150 |
+
add_generation_prompt=True,
|
| 151 |
+
tokenize=True,
|
| 152 |
+
return_dict=True,
|
| 153 |
+
return_tensors="pt"
|
| 154 |
+
).to("cuda") # Ensure inputs are on cuda
|
| 155 |
+
|
| 156 |
+
with torch.no_grad():
|
| 157 |
+
output = model.generate(**inputs, max_new_tokens=4096)
|
| 158 |
+
|
| 159 |
+
input_len = inputs["input_ids"].shape[-1]
|
| 160 |
+
res = processor_or_tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
|
| 161 |
+
all_results.append(f"--- Page/Image {i+1} ---\n{res}")
|
| 162 |
+
|
| 163 |
+
except Exception as e:
|
| 164 |
+
all_results.append(f"--- Page/Image {i+1} ---\nПомилка: {str(e)}")
|
| 165 |
+
finally:
|
| 166 |
+
# Move back to CPU and clean up to free ZeroGPU resources
|
| 167 |
+
print(f"Moving {model_choice} back to CPU...")
|
| 168 |
+
model.to("cpu")
|
| 169 |
+
if torch.cuda.is_available():
|
| 170 |
+
torch.cuda.empty_cache()
|
| 171 |
+
gc.collect()
|
| 172 |
+
|
| 173 |
+
return "\n\n".join(all_results)
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
return "\n\n".join(all_results)
|
| 177 |
+
|
| 178 |
+
def save_result_to_file(text):
|
| 179 |
+
if not text or text.startswith("Будь ласка") or text.startswith("Помилка"):
|
| 180 |
+
return None
|
| 181 |
+
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 182 |
+
filename = f"ocr_result_{timestamp}.txt"
|
| 183 |
+
os.makedirs("outputs", exist_ok=True)
|
| 184 |
+
filepath = os.path.abspath(os.path.join("outputs", filename))
|
| 185 |
+
with open(filepath, "w", encoding="utf-8") as f:
|
| 186 |
+
f.write(text)
|
| 187 |
+
return filepath
|
| 188 |
+
|
| 189 |
+
custom_css = """
|
| 190 |
+
.header { text-align: center; margin-bottom: 30px; }
|
| 191 |
+
.header h1 { font-size: 2.5rem; }
|
| 192 |
+
.footer { text-align: center; margin-top: 50px; font-size: 0.9rem; color: #718096; }
|
| 193 |
+
"""
|
| 194 |
+
|
| 195 |
+
with gr.Blocks(title="OCR Comparison: DeepSeek vs MedGemma", css=custom_css) as demo:
|
| 196 |
+
with gr.Column():
|
| 197 |
+
gr.Markdown("# 🔍 OCR & Medical Document Analysis", elem_classes="header")
|
| 198 |
+
gr.Markdown("Порівняння DeepSeek-OCR-2 та MedGemma-1.5-4B (HuggingFace ZeroGPU Edition)", elem_classes="header")
|
| 199 |
+
|
| 200 |
+
with gr.Row():
|
| 201 |
+
with gr.Column(scale=1):
|
| 202 |
+
with gr.Tab("Зображення"):
|
| 203 |
+
input_img = gr.Image(type="pil", label="Перетягніть зображення")
|
| 204 |
+
with gr.Tab("PDF / Файли"):
|
| 205 |
+
input_file = gr.File(label="Завантажте PDF або інший файл")
|
| 206 |
+
|
| 207 |
+
model_selector = gr.Dropdown(
|
| 208 |
+
choices=[DEEPSEEK_MODEL, MEDGEMMA_MODEL],
|
| 209 |
+
value=DEEPSEEK_MODEL,
|
| 210 |
+
label="Оберіть модель"
|
| 211 |
+
)
|
| 212 |
+
|
| 213 |
+
with gr.Accordion("Налаштування", open=False):
|
| 214 |
+
prompt_input = gr.Textbox(
|
| 215 |
+
value="",
|
| 216 |
+
label="Користувацький промпт (залиште порожнім для дефолтного)",
|
| 217 |
+
placeholder="Наприклад: Extract all text from image"
|
| 218 |
+
)
|
| 219 |
+
|
| 220 |
+
with gr.Row():
|
| 221 |
+
clear_btn = gr.Button("Очистити", variant="secondary")
|
| 222 |
+
ocr_btn = gr.Button("Запустити аналіз", variant="primary")
|
| 223 |
+
|
| 224 |
+
with gr.Column(scale=1):
|
| 225 |
+
output_text = gr.Textbox(
|
| 226 |
+
label="Результат",
|
| 227 |
+
lines=20
|
| 228 |
+
)
|
| 229 |
+
|
| 230 |
+
with gr.Row():
|
| 231 |
+
save_btn = gr.Button("Зберегти у файл 💾")
|
| 232 |
+
download_file = gr.File(label="Завантажити результат")
|
| 233 |
+
|
| 234 |
+
gr.Markdown("---")
|
| 235 |
+
gr.Markdown("### Як використовувати:\n1. Завантажте зображення або PDF.\n2. Виберіть модель.\n3. Натисніть 'Запустити аналіз'.\n*Примітка: MedGemma потребує HF_TOKEN з доступом до моделі.*")
|
| 236 |
+
|
| 237 |
+
# Event handlers
|
| 238 |
+
ocr_btn.click(
|
| 239 |
+
fn=run_ocr,
|
| 240 |
+
inputs=[input_img, input_file, model_selector, prompt_input],
|
| 241 |
+
outputs=output_text
|
| 242 |
+
)
|
| 243 |
+
|
| 244 |
+
save_btn.click(
|
| 245 |
+
fn=save_result_to_file,
|
| 246 |
+
inputs=output_text,
|
| 247 |
+
outputs=download_file
|
| 248 |
+
)
|
| 249 |
+
|
| 250 |
+
def clear_all():
|
| 251 |
+
return None, None, ""
|
| 252 |
+
|
| 253 |
+
clear_btn.click(
|
| 254 |
+
fn=clear_all,
|
| 255 |
+
inputs=None,
|
| 256 |
+
outputs=[input_img, input_file, output_text]
|
| 257 |
+
)
|
| 258 |
+
|
| 259 |
+
if __name__ == "__main__":
|
| 260 |
+
demo.queue().launch()
|
compare_models.py
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
from transformers import AutoModel, AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
|
| 3 |
+
from PIL import Image
|
| 4 |
+
import fitz
|
| 5 |
+
import os
|
| 6 |
+
import time
|
| 7 |
+
|
| 8 |
+
DEEPSEEK_MODEL = 'deepseek-ai/DeepSeek-OCR-2'
|
| 9 |
+
MEDGEMMA_MODEL = 'google/medgemma-1.5-4b-it'
|
| 10 |
+
|
| 11 |
+
if torch.backends.mps.is_available():
|
| 12 |
+
print("Patching torch for MPS compatibility...")
|
| 13 |
+
device = "mps"
|
| 14 |
+
torch.Tensor.cuda = lambda self, *args, **kwargs: self.to("mps")
|
| 15 |
+
torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to("mps")
|
| 16 |
+
torch.bfloat16 = torch.float16
|
| 17 |
+
dtype = torch.float16
|
| 18 |
+
else:
|
| 19 |
+
device = "cpu"
|
| 20 |
+
dtype = torch.float32
|
| 21 |
+
|
| 22 |
+
def get_page_image(pdf_path, page_num=0):
|
| 23 |
+
doc = fitz.open(pdf_path)
|
| 24 |
+
page = doc.load_page(page_num)
|
| 25 |
+
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
|
| 26 |
+
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
|
| 27 |
+
doc.close()
|
| 28 |
+
return img
|
| 29 |
+
|
| 30 |
+
def run_deepseek(img):
|
| 31 |
+
tokenizer = AutoTokenizer.from_pretrained(DEEPSEEK_MODEL, trust_remote_code=True)
|
| 32 |
+
model = AutoModel.from_pretrained(DEEPSEEK_MODEL, trust_remote_code=True, use_safetensors=True)
|
| 33 |
+
model = model.to(device=device, dtype=dtype).eval()
|
| 34 |
+
|
| 35 |
+
with torch.no_grad():
|
| 36 |
+
# Need a temp file for deepseek's .infer
|
| 37 |
+
img.save("temp_comp.png")
|
| 38 |
+
res = model.infer(
|
| 39 |
+
tokenizer,
|
| 40 |
+
prompt="<image>\nFree OCR. ",
|
| 41 |
+
image_file="temp_comp.png",
|
| 42 |
+
output_path="outputs",
|
| 43 |
+
base_size=1024,
|
| 44 |
+
image_size=768,
|
| 45 |
+
crop_mode=True,
|
| 46 |
+
eval_mode=True
|
| 47 |
+
)
|
| 48 |
+
os.remove("temp_comp.png")
|
| 49 |
+
return res
|
| 50 |
+
|
| 51 |
+
def run_medgemma(img):
|
| 52 |
+
processor = AutoProcessor.from_pretrained(MEDGEMMA_MODEL)
|
| 53 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 54 |
+
MEDGEMMA_MODEL,
|
| 55 |
+
trust_remote_code=True,
|
| 56 |
+
dtype=dtype if device == "mps" else torch.float32,
|
| 57 |
+
device_map="auto" if device != "mps" else None
|
| 58 |
+
).eval()
|
| 59 |
+
if device == "mps":
|
| 60 |
+
model = model.to("mps")
|
| 61 |
+
|
| 62 |
+
messages = [
|
| 63 |
+
{
|
| 64 |
+
"role": "user",
|
| 65 |
+
"content": [
|
| 66 |
+
{"type": "image", "image": img},
|
| 67 |
+
{"type": "text", "text": "Extract all text from this medical document."}
|
| 68 |
+
]
|
| 69 |
+
}
|
| 70 |
+
]
|
| 71 |
+
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
|
| 72 |
+
|
| 73 |
+
with torch.no_grad():
|
| 74 |
+
output = model.generate(**inputs, max_new_tokens=2048)
|
| 75 |
+
|
| 76 |
+
input_len = inputs["input_ids"].shape[-1]
|
| 77 |
+
return processor.decode(output[0][input_len:], skip_special_tokens=True)
|
| 78 |
+
|
| 79 |
+
def compare():
|
| 80 |
+
pdf_path = "doc_for_testing/pdf12_un.pdf"
|
| 81 |
+
if not os.path.exists(pdf_path):
|
| 82 |
+
print("PDF not found.")
|
| 83 |
+
return
|
| 84 |
+
|
| 85 |
+
img = get_page_image(pdf_path)
|
| 86 |
+
|
| 87 |
+
print("\n--- Running DeepSeek-OCR-2 ---")
|
| 88 |
+
start = time.time()
|
| 89 |
+
ds_res = run_deepseek(img)
|
| 90 |
+
print(f"Time: {time.time() - start:.2f}s")
|
| 91 |
+
|
| 92 |
+
print("\n--- Running MedGemma-1.5-4B ---")
|
| 93 |
+
start = time.time()
|
| 94 |
+
mg_res = run_medgemma(img)
|
| 95 |
+
print(f"Time: {time.time() - start:.2f}s")
|
| 96 |
+
|
| 97 |
+
with open("model_comparison.md", "w") as f:
|
| 98 |
+
f.write("# Comparison Report: DeepSeek-OCR-2 vs MedGemma-1.5-4B\n\n")
|
| 99 |
+
f.write("## DeepSeek-OCR-2 Result\n\n")
|
| 100 |
+
f.write(ds_res + "\n\n")
|
| 101 |
+
f.write("## MedGemma-1.5-4B Result\n\n")
|
| 102 |
+
f.write(mg_res + "\n")
|
| 103 |
+
|
| 104 |
+
if __name__ == "__main__":
|
| 105 |
+
compare()
|
convert_docs.py
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import fitz # PyMuPDF
|
| 2 |
+
import os
|
| 3 |
+
from PIL import Image
|
| 4 |
+
|
| 5 |
+
def convert_pdf_to_images(pdf_path, output_dir):
|
| 6 |
+
if not os.path.exists(output_dir):
|
| 7 |
+
os.makedirs(output_dir)
|
| 8 |
+
|
| 9 |
+
doc = fitz.open(pdf_path)
|
| 10 |
+
base_name = os.path.basename(pdf_path).split('.')[0]
|
| 11 |
+
image_paths = []
|
| 12 |
+
|
| 13 |
+
# Just take the first page for testing to save time/memory
|
| 14 |
+
for i in range(min(1, len(doc))):
|
| 15 |
+
page = doc.load_page(i)
|
| 16 |
+
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2)) # Zoom for better OCR
|
| 17 |
+
output_file = os.path.join(output_dir, f"{base_name}_page_{i+1}.png")
|
| 18 |
+
# Convert to PIL Image
|
| 19 |
+
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
|
| 20 |
+
img.save(output_file)
|
| 21 |
+
image_paths.append(output_file)
|
| 22 |
+
print(f"Converted {pdf_path} page {i+1} to {output_file}")
|
| 23 |
+
|
| 24 |
+
return image_paths
|
| 25 |
+
|
| 26 |
+
if __name__ == "__main__":
|
| 27 |
+
pdf_dir = "doc_for_testing"
|
| 28 |
+
output_dir = "doc_images"
|
| 29 |
+
for filename in os.listdir(pdf_dir):
|
| 30 |
+
if filename.endswith(".pdf"):
|
| 31 |
+
convert_pdf_to_images(os.path.join(pdf_dir, filename), output_dir)
|
convert_full_pdf.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import fitz # PyMuPDF
|
| 2 |
+
import os
|
| 3 |
+
from PIL import Image
|
| 4 |
+
|
| 5 |
+
def convert_full_pdf_to_images(pdf_path, output_dir):
|
| 6 |
+
if not os.path.exists(output_dir):
|
| 7 |
+
os.makedirs(output_dir)
|
| 8 |
+
|
| 9 |
+
doc = fitz.open(pdf_path)
|
| 10 |
+
base_name = os.path.basename(pdf_path).split('.')[0]
|
| 11 |
+
image_paths = []
|
| 12 |
+
|
| 13 |
+
print(f"Converting all {len(doc)} pages of {pdf_path}...")
|
| 14 |
+
for i in range(len(doc)):
|
| 15 |
+
page = doc.load_page(i)
|
| 16 |
+
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2)) # Zoom for better OCR
|
| 17 |
+
output_file = os.path.join(output_dir, f"{base_name}_page_{i+1}.png")
|
| 18 |
+
# Convert to PIL Image
|
| 19 |
+
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
|
| 20 |
+
img.save(output_file)
|
| 21 |
+
image_paths.append(output_file)
|
| 22 |
+
print(f"Converted page {i+1}/{len(doc)}")
|
| 23 |
+
|
| 24 |
+
return image_paths
|
| 25 |
+
|
| 26 |
+
if __name__ == "__main__":
|
| 27 |
+
pdf_file = "doc_for_testing/pdf12_un.pdf"
|
| 28 |
+
output_dir = "doc_images_full"
|
| 29 |
+
if os.path.exists(pdf_file):
|
| 30 |
+
convert_full_pdf_to_images(pdf_file, output_dir)
|
| 31 |
+
else:
|
| 32 |
+
print(f"File {pdf_file} not found.")
|
generate_test_image.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from PIL import Image, ImageDraw, ImageFont
|
| 2 |
+
import os
|
| 3 |
+
|
| 4 |
+
def create_sample_image(text, filename):
|
| 5 |
+
# Create a white image
|
| 6 |
+
img = Image.new('RGB', (800, 400), color=(255, 255, 255))
|
| 7 |
+
d = ImageDraw.Draw(img)
|
| 8 |
+
|
| 9 |
+
# Try to load a default font
|
| 10 |
+
try:
|
| 11 |
+
# On macOS, this might work
|
| 12 |
+
font = ImageFont.truetype("/System/Library/Fonts/Supplemental/Arial.ttf", 40)
|
| 13 |
+
except:
|
| 14 |
+
font = ImageFont.load_default()
|
| 15 |
+
|
| 16 |
+
d.text((100, 150), text, fill=(0, 0, 0), font=font)
|
| 17 |
+
img.save(filename)
|
| 18 |
+
print(f"Created {filename}")
|
| 19 |
+
|
| 20 |
+
if __name__ == "__main__":
|
| 21 |
+
create_sample_image("DeepSeek-OCR-2 test. Hello World!", "sample_test.png")
|
ocr_full_pdf12.py
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoModel, AutoTokenizer
|
| 2 |
+
import torch
|
| 3 |
+
import os
|
| 4 |
+
from PIL import Image
|
| 5 |
+
import time
|
| 6 |
+
|
| 7 |
+
# Force CPU for stability on Mac
|
| 8 |
+
device = "cpu"
|
| 9 |
+
print(f"Using device: {device}")
|
| 10 |
+
|
| 11 |
+
# Patch to avoid CUDA calls in custom code
|
| 12 |
+
torch.Tensor.cuda = lambda self, *args, **kwargs: self.to(device)
|
| 13 |
+
torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to(device)
|
| 14 |
+
|
| 15 |
+
model_name = 'deepseek-ai/DeepSeek-OCR-2'
|
| 16 |
+
|
| 17 |
+
def ocr_full_document():
|
| 18 |
+
print(f"Loading tokenizer...")
|
| 19 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 20 |
+
|
| 21 |
+
print(f"Loading model...")
|
| 22 |
+
model = AutoModel.from_pretrained(
|
| 23 |
+
model_name,
|
| 24 |
+
trust_remote_code=True,
|
| 25 |
+
use_safetensors=True
|
| 26 |
+
)
|
| 27 |
+
model = model.eval()
|
| 28 |
+
|
| 29 |
+
# Overwrite bfloat16 to float32 for CPU compatibility
|
| 30 |
+
torch.bfloat16 = torch.float32
|
| 31 |
+
|
| 32 |
+
image_dir = "doc_images_full"
|
| 33 |
+
output_dir = "ocr_results_pdf12"
|
| 34 |
+
os.makedirs(output_dir, exist_ok=True)
|
| 35 |
+
|
| 36 |
+
# Get images sorted by page number
|
| 37 |
+
import re
|
| 38 |
+
def get_page_num(filename):
|
| 39 |
+
match = re.search(r'page_(\d+)', filename)
|
| 40 |
+
return int(match.group(1)) if match else 0
|
| 41 |
+
|
| 42 |
+
images = sorted([f for f in os.listdir(image_dir) if f.endswith(".png")], key=get_page_num)
|
| 43 |
+
|
| 44 |
+
full_markdown = []
|
| 45 |
+
|
| 46 |
+
for i, img_name in enumerate(images):
|
| 47 |
+
img_path = os.path.join(image_dir, img_name)
|
| 48 |
+
print(f"\n[{i+1}/{len(images)}] Processing page {get_page_num(img_name)}...")
|
| 49 |
+
|
| 50 |
+
prompt = "<image>\nFree OCR. "
|
| 51 |
+
|
| 52 |
+
start_time = time.time()
|
| 53 |
+
try:
|
| 54 |
+
with torch.no_grad():
|
| 55 |
+
res = model.infer(
|
| 56 |
+
tokenizer,
|
| 57 |
+
prompt=prompt,
|
| 58 |
+
image_file=img_path,
|
| 59 |
+
output_path=output_dir,
|
| 60 |
+
base_size=1024,
|
| 61 |
+
image_size=768,
|
| 62 |
+
crop_mode=False,
|
| 63 |
+
eval_mode=True
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
elapsed = time.time() - start_time
|
| 67 |
+
print(f" Done in {elapsed:.2f}s")
|
| 68 |
+
|
| 69 |
+
# Save individual page result
|
| 70 |
+
page_file = os.path.join(output_dir, f"{img_name}.md")
|
| 71 |
+
with open(page_file, "w") as f:
|
| 72 |
+
f.write(res)
|
| 73 |
+
|
| 74 |
+
full_markdown.append(f"## Page {get_page_num(img_name)}\n\n{res}\n\n---\n")
|
| 75 |
+
|
| 76 |
+
except Exception as e:
|
| 77 |
+
print(f" Failed: {e}")
|
| 78 |
+
full_markdown.append(f"## Page {get_page_num(img_name)}\n\n[OCR FAILED]\n\n---\n")
|
| 79 |
+
|
| 80 |
+
# Save combined result
|
| 81 |
+
combined_file = os.path.join(output_dir, "full_document.md")
|
| 82 |
+
with open(combined_file, "w") as f:
|
| 83 |
+
f.write("# OCR Result for pdf12_un.pdf\n\n")
|
| 84 |
+
f.write("".join(full_markdown))
|
| 85 |
+
|
| 86 |
+
print(f"\nCompleted! Full result saved to: {combined_file}")
|
| 87 |
+
|
| 88 |
+
if __name__ == "__main__":
|
| 89 |
+
ocr_full_document()
|
requirements.txt
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch
|
| 2 |
+
transformers>=4.45.0
|
| 3 |
+
tokenizers
|
| 4 |
+
einops
|
| 5 |
+
addict
|
| 6 |
+
easydict
|
| 7 |
+
accelerate
|
| 8 |
+
sentencepiece
|
| 9 |
+
pillow
|
| 10 |
+
matplotlib
|
| 11 |
+
requests
|
| 12 |
+
torchvision
|
| 13 |
+
gradio
|
| 14 |
+
pymupdf
|
| 15 |
+
spaces
|
| 16 |
+
huggingface-hub
|
test_inference.py
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoModel, AutoTokenizer
|
| 2 |
+
import torch
|
| 3 |
+
import torch.nn as nn
|
| 4 |
+
import os
|
| 5 |
+
from PIL import Image, ImageOps
|
| 6 |
+
import math
|
| 7 |
+
|
| 8 |
+
# Force CPU
|
| 9 |
+
device = "cpu"
|
| 10 |
+
dtype = torch.float32
|
| 11 |
+
print(f"Forcing device: {device} with dtype: {dtype}")
|
| 12 |
+
|
| 13 |
+
# Patch torch types to avoid mixed precision errors in their custom code
|
| 14 |
+
torch.bfloat16 = torch.float32 # Force bfloat16 to float32
|
| 15 |
+
torch.Tensor.cuda = lambda self, *args, **kwargs: self.to("cpu")
|
| 16 |
+
torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to("cpu")
|
| 17 |
+
|
| 18 |
+
model_name = 'deepseek-ai/DeepSeek-OCR-2'
|
| 19 |
+
|
| 20 |
+
def test_inference():
|
| 21 |
+
print(f"Loading tokenizer for {model_name}...")
|
| 22 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 23 |
+
|
| 24 |
+
print(f"Loading model for {model_name}...")
|
| 25 |
+
model = AutoModel.from_pretrained(
|
| 26 |
+
model_name,
|
| 27 |
+
trust_remote_code=True,
|
| 28 |
+
use_safetensors=True,
|
| 29 |
+
torch_dtype=torch.float32 # Explicitly float32
|
| 30 |
+
)
|
| 31 |
+
|
| 32 |
+
model = model.eval() # Already on CPU by default if no device_map
|
| 33 |
+
|
| 34 |
+
output_dir = 'outputs'
|
| 35 |
+
os.makedirs(output_dir, exist_ok=True)
|
| 36 |
+
|
| 37 |
+
prompt = "<image>\nFree OCR. "
|
| 38 |
+
image_file = 'sample_test.png'
|
| 39 |
+
|
| 40 |
+
if not os.path.exists(image_file):
|
| 41 |
+
print(f"Error: {image_file} not found.")
|
| 42 |
+
return
|
| 43 |
+
|
| 44 |
+
print("Running inference on CPU...")
|
| 45 |
+
try:
|
| 46 |
+
with torch.no_grad():
|
| 47 |
+
res = model.infer(
|
| 48 |
+
tokenizer,
|
| 49 |
+
prompt=prompt,
|
| 50 |
+
image_file=image_file,
|
| 51 |
+
output_path=output_dir,
|
| 52 |
+
base_size=512,
|
| 53 |
+
image_size=384,
|
| 54 |
+
crop_mode=False,
|
| 55 |
+
eval_mode=True
|
| 56 |
+
)
|
| 57 |
+
print("\n--- OCR Result ---")
|
| 58 |
+
print(res)
|
| 59 |
+
print("------------------")
|
| 60 |
+
except Exception as e:
|
| 61 |
+
print(f"Inference failed: {e}")
|
| 62 |
+
import traceback
|
| 63 |
+
traceback.print_exc()
|
| 64 |
+
|
| 65 |
+
if __name__ == "__main__":
|
| 66 |
+
test_inference()
|
test_medgemma.py
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoProcessor, AutoModelForImageTextToText
|
| 2 |
+
import torch
|
| 3 |
+
from PIL import Image
|
| 4 |
+
import os
|
| 5 |
+
|
| 6 |
+
model_id = "google/medgemma-1.5-4b-it"
|
| 7 |
+
|
| 8 |
+
def test_medgemma():
|
| 9 |
+
print(f"Loading {model_id}...")
|
| 10 |
+
try:
|
| 11 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 12 |
+
# We try to load without device_map="auto" for MPS or manual device control
|
| 13 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 14 |
+
model_id,
|
| 15 |
+
torch_dtype=torch.float32, # CPU usually stable with float32
|
| 16 |
+
trust_remote_code=True
|
| 17 |
+
).eval()
|
| 18 |
+
|
| 19 |
+
print("Model loaded.")
|
| 20 |
+
|
| 21 |
+
image_path = "sample_test.png"
|
| 22 |
+
if not os.path.exists(image_path):
|
| 23 |
+
print("No test image found.")
|
| 24 |
+
return
|
| 25 |
+
|
| 26 |
+
image = Image.open(image_path).convert("RGB")
|
| 27 |
+
|
| 28 |
+
# Use chat template as suggested
|
| 29 |
+
messages = [
|
| 30 |
+
{
|
| 31 |
+
"role": "user",
|
| 32 |
+
"content": [
|
| 33 |
+
{"type": "image", "image": image},
|
| 34 |
+
{"type": "text", "text": "Extract all text from this image."}
|
| 35 |
+
]
|
| 36 |
+
}
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
inputs = processor.apply_chat_template(
|
| 40 |
+
messages,
|
| 41 |
+
add_generation_prompt=True,
|
| 42 |
+
tokenize=True,
|
| 43 |
+
return_dict=True,
|
| 44 |
+
return_tensors="pt"
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
print("Running inference...")
|
| 48 |
+
with torch.no_grad():
|
| 49 |
+
output = model.generate(**inputs, max_new_tokens=100)
|
| 50 |
+
|
| 51 |
+
input_len = inputs["input_ids"].shape[-1]
|
| 52 |
+
result = processor.decode(output[0][input_len:], skip_special_tokens=True)
|
| 53 |
+
|
| 54 |
+
print("\n--- MedGemma Result ---")
|
| 55 |
+
print(result)
|
| 56 |
+
print("-----------------------")
|
| 57 |
+
|
| 58 |
+
except Exception as e:
|
| 59 |
+
print(f"Error: {e}")
|
| 60 |
+
import traceback
|
| 61 |
+
traceback.print_exc()
|
| 62 |
+
|
| 63 |
+
if __name__ == "__main__":
|
| 64 |
+
test_medgemma()
|
test_minimal.py
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoModel, AutoTokenizer
|
| 2 |
+
import torch
|
| 3 |
+
import os
|
| 4 |
+
from PIL import Image
|
| 5 |
+
|
| 6 |
+
model_name = 'deepseek-ai/DeepSeek-OCR-2'
|
| 7 |
+
|
| 8 |
+
def test_inference():
|
| 9 |
+
print(f"Loading tokenizer for {model_name}...")
|
| 10 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 11 |
+
|
| 12 |
+
print(f"Loading model for {model_name}...")
|
| 13 |
+
# Load model on CPU
|
| 14 |
+
model = AutoModel.from_pretrained(
|
| 15 |
+
model_name,
|
| 16 |
+
trust_remote_code=True,
|
| 17 |
+
use_safetensors=True
|
| 18 |
+
)
|
| 19 |
+
|
| 20 |
+
# Check if loaded
|
| 21 |
+
print("Model loaded successfully.")
|
| 22 |
+
print(f"Model type: {type(model)}")
|
| 23 |
+
|
| 24 |
+
# Test simple tokenization
|
| 25 |
+
inputs = tokenizer("Hello", return_tensors="pt")
|
| 26 |
+
print("Tokenizer test: Success")
|
| 27 |
+
|
| 28 |
+
print("DeepSeek-OCR-2 is ready for use.")
|
| 29 |
+
|
| 30 |
+
if __name__ == "__main__":
|
| 31 |
+
test_inference()
|
test_real_docs.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import AutoModel, AutoTokenizer
|
| 2 |
+
import torch
|
| 3 |
+
import os
|
| 4 |
+
from PIL import Image
|
| 5 |
+
import time
|
| 6 |
+
|
| 7 |
+
# Force CPU for stability
|
| 8 |
+
device = "cpu"
|
| 9 |
+
print(f"Using device: {device}")
|
| 10 |
+
|
| 11 |
+
# Patch to avoid CUDA calls in custom code
|
| 12 |
+
torch.Tensor.cuda = lambda self, *args, **kwargs: self.to(device)
|
| 13 |
+
torch.nn.Module.cuda = lambda self, *args, **kwargs: self.to(device)
|
| 14 |
+
|
| 15 |
+
model_name = 'deepseek-ai/DeepSeek-OCR-2'
|
| 16 |
+
|
| 17 |
+
def test_docs():
|
| 18 |
+
print(f"Loading tokenizer...")
|
| 19 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 20 |
+
|
| 21 |
+
print(f"Loading model (may take a minute)...")
|
| 22 |
+
# Load with default parameters that worked in test_minimal.py
|
| 23 |
+
model = AutoModel.from_pretrained(
|
| 24 |
+
model_name,
|
| 25 |
+
trust_remote_code=True,
|
| 26 |
+
use_safetensors=True
|
| 27 |
+
)
|
| 28 |
+
model = model.eval()
|
| 29 |
+
|
| 30 |
+
# After loading, we monkeypatch bfloat16 for the inference logic
|
| 31 |
+
torch.bfloat16 = torch.float32
|
| 32 |
+
|
| 33 |
+
image_dir = "doc_images"
|
| 34 |
+
output_dir = "ocr_results"
|
| 35 |
+
os.makedirs(output_dir, exist_ok=True)
|
| 36 |
+
|
| 37 |
+
images = sorted([f for f in os.listdir(image_dir) if f.endswith(".png")])
|
| 38 |
+
|
| 39 |
+
for img_name in images:
|
| 40 |
+
img_path = os.path.join(image_dir, img_name)
|
| 41 |
+
print(f"\n--- Processing: {img_name} ---")
|
| 42 |
+
|
| 43 |
+
# DeepSeek-OCR-2 needs specific ratios for its hardcoded query embeddings
|
| 44 |
+
# base_size=1024 -> n_query=256 (supported)
|
| 45 |
+
# image_size=768 -> n_query=144 (supported)
|
| 46 |
+
|
| 47 |
+
prompt = "<image>\nFree OCR. "
|
| 48 |
+
|
| 49 |
+
start_time = time.time()
|
| 50 |
+
try:
|
| 51 |
+
with torch.no_grad():
|
| 52 |
+
res = model.infer(
|
| 53 |
+
tokenizer,
|
| 54 |
+
prompt=prompt,
|
| 55 |
+
image_file=img_path,
|
| 56 |
+
output_path=output_dir,
|
| 57 |
+
base_size=1024, # Must be 1024 for 256 queries
|
| 58 |
+
image_size=768, # Must be 768 for 144 queries
|
| 59 |
+
crop_mode=False,
|
| 60 |
+
eval_mode=True
|
| 61 |
+
)
|
| 62 |
+
|
| 63 |
+
elapsed = time.time() - start_time
|
| 64 |
+
print(f"Done in {elapsed:.2f}s")
|
| 65 |
+
|
| 66 |
+
result_file = os.path.join(output_dir, f"{img_name}.md")
|
| 67 |
+
with open(result_file, "w") as f:
|
| 68 |
+
f.write(res)
|
| 69 |
+
|
| 70 |
+
print(f"Result saved to {result_file}")
|
| 71 |
+
print("Preview (first 500 chars):")
|
| 72 |
+
print("-" * 20)
|
| 73 |
+
print(res[:500] + "...")
|
| 74 |
+
print("-" * 20)
|
| 75 |
+
|
| 76 |
+
except Exception as e:
|
| 77 |
+
print(f"Inference failed for {img_name}: {e}")
|
| 78 |
+
import traceback
|
| 79 |
+
traceback.print_exc()
|
| 80 |
+
|
| 81 |
+
if __name__ == "__main__":
|
| 82 |
+
test_docs()
|