Spaces:

DrAbdulmalek
/

OmniFile-Processor

Sleeping

App Files Files Community

OmniFile-Processor / docs /DEVELOPER_GUIDE.md

Dr. Abdulmalek

deploy: OmniFile AI Processor v4.3.0

900df0b 15 days ago

preview code

raw

history blame contribute delete

52.6 kB

دليل المطور - OmniFile AI Processor v2.1

Developer Guide - OmniFile AI Processor v2.1

دليل شامل للمطورين لبناء وتوسيع النظام Comprehensive developer guide for building and extending the system

المحتويات / Table of Contents

هيكل المشروع / Project Structure
البنية المعمارية / Architecture
إضافة ميزة جديدة / Adding a New Feature
إضافة لغة جديدة / Adding a New Language
إضافة محرك OCR جديد / Adding a New OCR Engine
الاختبارات / Testing
النشر / Deployment
معايير الكود / Code Standards
GitHub Actions CI/CD

1. هيكل المشروع / Project Structure

OmniFile_Processor/
|
|-- app.py                          # واجهة Streamlit الرئيسية / Main Streamlit UI
|-- main.py                         # نقطة الدخول / Entry point (CLI arguments)
|-- config.py                       # الإعدادات المركزية / Central config (OmniFileConfig dataclass)
|-- database.py                     # قاعدة بيانات SQLite / SQLite database (OmniFileDB)
|-- tasks.py                        # مهام Celery / Celery async tasks
|-- requirements.txt                # التبعيات / Dependencies
|-- Dockerfile                      # Docker للنشر / Docker for deployment
|-- LICENSE                         # الترخيص / License
|-- __init__.py                     # إصدار المشروع / Project version
|
|-- modules/                        # الوحدات الفرعية / Sub-modules
|   |-- __init__.py
|   |
|   |-- nlp/                        # معالجة اللغة الطبيعية / NLP
|   |   |-- __init__.py
|   |   |-- spell_corrector.py      # المصحح الإملائي / Spell corrector (EN/AR/DE)
|   |   |-- translator.py           # المترجم التقني / Technical translator
|   |   |-- summarizer.py           # ملخص النصوص / Text summarizer (BART)
|   |   |-- entity_extractor.py     # استخراج الكيانات / NER entity extraction
|   |   |-- text_classifier.py      # تصنيف النصوص / Text classification
|   |   |-- language_detector.py    # كشف اللغة / Language detection
|   |   |-- correction_dict.json    # قاموس التصحيحات المُتعلمة / Learned corrections
|   |
|   |-- vision/                     # الرؤية الحاسوبية / Computer Vision
|   |   |-- __init__.py
|   |   |-- ocr_engine.py           # محرك OCR المتكامل / Integrated OCR engine
|   |   |-- pdf_processor.py        # معالج PDF / PDF processor
|   |   |-- image_preprocessor.py   # معالجة الصور المسبقة / Image preprocessing
|   |   |-- text_reconstructor.py   # إعادة تجميع النصوص / Text reconstruction
|   |
|   |-- security/                   # الأمان والحماية / Security
|       |-- __init__.py
|       |-- sensitive_data_scanner.py # فحص البيانات الحساسة / Sensitive data scanner
|       |-- file_scanner.py         # فحص الملفات / File scanner
|       |-- file_organizer.py       # تنظيم الملفات / File organizer
|       |-- backup_manager.py       # النسخ الاحتياطي / Backup manager
|       |-- archive_handler.py      # معالجة الأرشيفات / Archive handler
|       |-- code_protector.py       # حماية الكود / Code protection
|
|-- src/                            # محرك HandwrittenOCR المتقدم / Advanced HandwrittenOCR
|   |-- __init__.py
|   |-- main.py                     # نقطة دخول المحرك / Engine entry point
|   |-- gradio_ui.py                # واجهة Gradio المتقدمة / Advanced Gradio UI
|   |-- recognition.py              # التعرف المتقدم / Advanced recognition
|   |-- preprocessing.py            # المعالجة المسبقة المتقدمة / Advanced preprocessing
|   |-- reconstruction.py           # إعادة البناء المتقدمة / Advanced reconstruction
|   |-- correction.py               # التصحيح المتقدم / Advanced correction
|   |-- export.py                   # التصدير / Export
|   |-- pdf_processor.py            # معالج PDF متقدم / Advanced PDF processor
|   |-- review_ui.py                # واجهة المراجعة / Review UI
|   |-- study_guide.py              # دليل الدراسة / Study guide
|   |-- finetuning.py               # التدريب الدقيق / Fine-tuning
|   |-- metrics.py                  # مقاييس الأداء / Performance metrics
|   |-- database.py                 # قاعدة بيانات المحرك / Engine database
|   |-- migration.py                # ترحيل البيانات / Data migration
|   |-- sync.py                     # المزامنة / Synchronization
|   |-- logger.py                   # نظام التسجيل / Logging system
|
|-- tests/                          # الاختبارات / Tests
|   |-- __init__.py
|   |-- conftest.py                 # Fixtures المشتركة / Shared fixtures
|   |-- test_spell_corrector.py     # اختبارات المصحح / Spell corrector tests
|   |-- test_summarizer.py          # اختبارات التلخيص / Summarizer tests
|   |-- test_sensitive_scanner.py   # اختبارات فحص البيانات / Sensitive data tests
|
|-- notebooks/                      | # دفاتر Jupyter / Jupyter notebooks
|   |-- OmniFile_Complete.ipynb     # دفتر شامل / Complete notebook
|   |-- HandwrittenOCR_Ultimate.ipynb
|   |-- HandwrittenOCR_Colab.ipynb
|
|-- data_seed/                      | # بيانات أولية / Seed data
|   |-- correction_dict_seed.json   # بذرة قاموس التصحيح / Corrections seed
|
|-- artifacts/                      | # ملفات مُنتجة / Generated artifacts
|   |-- correction_dict.json
|
|-- database/                       | # ملفات قاعدة البيانات / Database files
|-- data/                           | # بيانات المستخدم / User data
|   |-- raw/                        | # ملفات خام / Raw files
|   |   |-- pdfs/
|   |   |-- images/
|   |   |-- archives/
|   |-- processed/                  | # ملفات معالجة / Processed files
|   |-- exports/                    | # ملفات مُصدّرة / Exported files
|-- models_cache/                   | # تخزين النماذج / Model cache
|-- backups/                        | # نسخ احتياطية / Backups
|-- logs/                           | # سجلات / Logs
|-- docs/                           | # التوثيق / Documentation

2. البنية المعمارية / Architecture

2.1 نظرة عامة / Overview

يعتمد النظام على بنية وحدات (Modular) مع تحميل بطيء (Lazy Loading) وانحطاط سلس (Graceful Degradation):

┌─────────────────────────────────────────────────────────────┐
│                    واجهة المستخدم / UI                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  Streamlit   │  │   Gradio     │  │      CLI         │   │
│  │  (app.py)    │  │ (gradio_ui)  │  │  (main.py)       │   │
│  └──────┬───────┘  └──────┬───────┘  └────────┬─────────┘   │
└─────────┼─────────────────┼───────────────────┼─────────────┘
          │                 │                   │
          v                 v                   v
┌─────────────────────────────────────────────────────────────┐
│                    الإعدادات / Config                        │
│              OmniFileConfig (config.py)                      │
│     إعدادات OCR | NLP | الأمان | النشر | اللغات             │
└──────────────────────────┬──────────────────────────────────┘
                           │
          ┌────────────────┼────────────────┐
          v                v                v
┌─────────────┐   ┌──────────────┐  ┌──────────────┐
│ modules/    │   │ modules/     │  │ modules/     │
│  vision/    │   │  nlp/        │  │  security/   │
│             │   │              │  │              │
│ OCR Engine  │   │ SpellCheck   │  │ SensitiveScan│
│ PDF Process │   │ Translator   │  │ FileOrganize │
│ ImgProcess  │   │ Summarizer   │  │ BackupMgr    │
│ TextRecon   │   │ NER          │  │ CodeProtect  │
│             │   │ Classify     │  │ FileScan     │
│             │   │ LangDetect   │  │ ArchiveHand  │
└──────┬──────┘   └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       v                 v                 v
┌─────────────────────────────────────────────────────────────┐
│                    قاعدة البيانات / Database                  │
│                 OmniFileDB (database.py)                      │
│     documents | ocr_results | translations | entities        │
│     corrections_log | processing_history                     │
└─────────────────────────────────────────────────────────────┘

2.2 الوحدات الأساسية / Core Modules

`config.py` - الإعدادات المركزية / Central Configuration

فئة OmniFileConfig هي Dataclass مركزية تحتوي جميع إعدادات النظام: The OmniFileConfig class is a central dataclass containing all system settings:

from config import OmniFileConfig

# إنشاء إعدادات لبيئة مختلفة / Create config for different environments
cfg_local = OmniFileConfig.from_local(project_root="~/OmniFile_AI")
cfg_colab = OmniFileConfig.from_colab_drive()
cfg_custom = OmniFileConfig(
    use_gpu=True,
    enable_trocr=True,
    trocr_model_variant="small",
)

# حفظ وتحميل الإعدادات / Save and load config
cfg.save("config/settings.json")
cfg_loaded = OmniFileConfig.load("config/settings.json")

# إعداد البيئة / Setup environment
cfg.setup_environment()  # ينشئ المجلدات ويضبط المتغيرات

# خصائص مفيدة / Useful properties
cfg.db_path           # مسار قاعدة البيانات
cfg.data_raw_dir      # مجلد الملفات الخام
cfg.models_cache_dir  # مجلد تخزين النماذج
cfg.is_colab          # كشف بيئة Colab

`database.py` - قاعدة البيانات / Database

فئة OmniFileDB تدير قاعدة بيانات SQLite بوضع WAL: The OmniFileDB class manages an SQLite database in WAL mode:

from database import OmniFileDB

db = OmniFileDB("omnifile_data.db")
db.create_tables()

# إدراج مستند / Insert document
doc_id = db.insert_document({
    "file_name": "test.pdf",
    "file_type": "pdf",
    "raw_text": "Hello World",
    "language": "en",
})

# تحديث مستند / Update document
db.update_document(doc_id, {"corrected_text": "Hello World!", "is_reviewed": True})

# جلب مستند / Get document
doc = db.get_document(doc_id)

# بحث / Search
results = db.search_documents("hello")

# إحصائيات / Statistics
stats = db.get_stats()

# تصدير / Export
db.export_to_json("backup.json")

الجداول / Tables:

الجدول / Table	الوصف / Description
`documents`	المستندات الأساسية (file_name, raw_text, corrected_text, category, ...)
`ocr_results`	نتائج OCR التفصيلية (word_text, confidence, model_source, x, y, w, h)
`translations`	الترجمات (source_text, translated_text, source_lang, target_lang)
`entities`	الكيانات المسماة (entity_text, entity_type, confidence)
`corrections_log`	سجل التصحيحات (original_text, corrected_text, auto_or_manual)
`processing_history`	سجل المعالجة (action, target, status, duration_sec)

`app.py` - واجهة Streamlit / Streamlit UI

الواجهة الرئيسية مبنية بتبويبات Streamlit: The main interface is built with Streamlit tabs:

# app.py يستخدم OmniFileConfig و OmniFileDB
# app.py uses OmniFileConfig and OmniFileDB

# الهيكل العام / General structure:
# - شريط جانبي للإعدادات / Sidebar for settings
# - تبويبات للميزات / Tabs for features
# - رفع الملفات / File upload
# - عرض النتائج / Results display

2.3 الوحدات الفرعية / Sub-modules

`modules/nlp/` - معالجة اللغة الطبيعية / NLP

الملف / File	الفئة / Class	الوصف / Description
`spell_corrector.py`	`SpellCorrector`	تصحيح إملائي متعدد اللغات (EN/AR/DE) مع حماية المصطلحات
`translator.py`	`TechnicalTranslator`	ترجمة تقنية مع حماية الكود وقاموس مدمج
`summarizer.py`	`TextSummarizer`	تلخيص بـ BART مع كشف لغة تلقائي وكاش
`entity_extractor.py`	-	استخراج الكيانات المسماة (NER) بـ AraBERT
`text_classifier.py`	`TextClassifier`	تصنيف المستندات بـ AraBERTv2
`language_detector.py`	`LanguageDetector`	كشف لغة النص تلقائياً

`modules/vision/` - الرؤية الحاسوبية / Computer Vision

الملف / File	الفئة / Class	الوصف / Description
`ocr_engine.py`	`OCREngine`	محرك متكامل (TrOCR + EasyOCR + Tesseract) مع كاش و ONNX
`pdf_processor.py`	`PDFProcessor`	معالجة PDF (تحويل صفحات لصور، استخراج نص)
`image_preprocessor.py`	`ImagePreprocessor`	معالجة مسبقة (CLAHE, denoise, deskew, binarize)
`text_reconstructor.py`	`TextReconstructor`	إعادة تجميع النص من مربعات الكلمات

`modules/security/` - الأمان والحماية / Security

الملف / File	الفئة / Class	الوصف / Description
`sensitive_data_scanner.py`	`SensitiveDataScanner`	فحص بيانات حساسة (Presidio + Regex) وإخفائها
`file_scanner.py`	-	فحص أمني للملفات (امتدادات، أنماط)
`file_organizer.py`	`FileOrganizer`	فرز تلقائي حسب النوع
`backup_manager.py`	`BackupManager`	نسخ احتياطي واستعادة
`archive_handler.py`	-	معالجة الأرشيفات (zip, tar.gz)
`code_protector.py`	-	حماية مقاطع الكود من المعالجة

`src/` - محرك HandwrittenOCR المتقدم / Advanced HandwrittenOCR Engine

محرك متقدم للخط اليدوي مع واجهة Gradio: Advanced handwriting engine with Gradio interface:

الملف / File	الوصف / Description
`main.py`	نقطة دخول المحرك
`gradio_ui.py`	واجهة Gradio المتقدمة
`recognition.py`	التعرف المتقدم بالدفعات
`preprocessing.py`	معالجة مسبقة متقدمة
`reconstruction.py`	إعادة بناء النص
`correction.py`	التصحيح المتقدم
`finetuning.py`	التدريب الدقيق بـ LoRA
`metrics.py`	حساب CER/WER
`study_guide.py`	إنشاء أدلة دراسة

3. إضافة ميزة جديدة / Adding a New Feature

مثال عملي: إضافة ميزة Question Answering / Practical Example: Adding QA Feature

الخطوة 1: إنشاء ملف الوحدة / Step 1: Create the Module File

# modules/nlp/qa_engine.py
"""
محرك الأسئلة والأجوبة (Question Answering Engine) v1.0
==========================================================
إجابة على الأسئلة بناءً على سياق نصي.

المؤلف: Your Name
الإصدار: 1.0.0
"""

import logging
from typing import Optional

logger = logging.getLogger(__name__)


class QAEngine:
    """
    محرك الأسئلة والأجوبة — يجيب على أسئلة بناءً على سياق نصي.

    الميزات:
    - تحميل بطيء (Lazy Loading)
    - كشف GPU تلقائي
    - دعم اللغتين العربية والإنجليزية
    - انحطاط سلس عند الفشل
    """

    # النماذج المدعومة حسب اللغة
    MODELS_BY_LANG = {
        "en": "deepset/roberta-base-squad2",
        "ar": "deepset/roberta-base-squad2",  # يمكن تغييره لنموذج عربي
    }

    def __init__(
        self,
        model_name: Optional[str] = None,
        device: Optional[str] = None,
        max_answer_length: int = 50,
    ) -> None:
        """
        تهيئة محرك QA.

        المعاملات:
            model_name: اسم النموذج (إذا None، يُختار حسب اللغة)
            device: الجهاز ('cuda' أو 'cpu' أو None لتلقائي)
            max_answer_length: أقصى طول للإجابة
        """
        self.model_name = model_name
        self._device = device or self._detect_device()
        self.max_answer_length = max_answer_length

        # النموذج - تُحمّل بشكل بطيء
        self._pipeline = None
        self._loaded_model_name = None

        # فحص توفر المكتبات
        self._has_transformers = self._check_library("transformers")
        self._has_torch = self._check_library("torch")

    @staticmethod
    def _detect_device() -> str:
        """كشف أفضل جهاز متاح."""
        try:
            import torch
            if torch.cuda.is_available():
                return "cuda"
        except (ImportError, Exception):
            pass
        return "cpu"

    @staticmethod
    def _check_library(import_name: str) -> bool:
        """التحقق من توفر مكتبة."""
        try:
            __import__(import_name)
            return True
        except ImportError:
            return False

    def _load_pipeline(self, model_name: str) -> bool:
        """تحميل نموذج QA (يتم مرة واحدة)."""
        if self._loaded_model_name == model_name and self._pipeline is not None:
            return True

        if not (self._has_transformers and self._has_torch):
            logger.warning(
                "مكتبات transformers/torch غير مثبتة. "
                "pip install transformers torch"
            )
            return False

        try:
            from transformers import pipeline

            logger.info("جارٍ تحميل نموذج QA: %s على %s...", model_name, self._device)
            self._pipeline = pipeline(
                "question-answering",
                model=model_name,
                device=self._device,
            )
            self._loaded_model_name = model_name
            logger.info("تم تحميل نموذج QA بنجاح")
            return True

        except Exception as e:
            logger.error("فشل تحميل نموذج QA '%s': %s", model_name, e)
            return False

    def answer(
        self,
        question: str,
        context: str,
        language: Optional[str] = None,
    ) -> dict:
        """
        الإجابة على سؤال بناءً على سياق.

        المعاملات:
            question: السؤال
            context: النص المرجعي
            language: لغة النص (إذا None، يُكشف تلقائياً)

        العائد:
            قاموس يحتوي:
                - answer: الإجابة
                - score: مستوى الثقة
                - start: بداية الإجابة في السياق
                - end: نهاية الإجابة في السياق
                - model: النموذج المستخدم
        """
        if not question or not context:
            return {
                "answer": "",
                "score": 0.0,
                "start": 0,
                "end": 0,
                "model": "none",
                "error": "question_or_context_empty",
            }

        # اختيار النموذج
        lang = language or "en"
        model_name = self.model_name or self.MODELS_BY_LANG.get(lang, self.MODELS_BY_LANG["en"])

        # تحميل النموذج
        if not self._load_pipeline(model_name):
            return {
                "answer": "النموذج غير متاح",
                "score": 0.0,
                "start": 0,
                "end": 0,
                "model": "none",
                "error": "model_not_loaded",
            }

        # الإجابة
        try:
            result = self._pipeline(
                question=question,
                context=context,
                max_answer_length=self.max_answer_length,
            )
            return {
                "answer": result.get("answer", ""),
                "score": result.get("score", 0.0),
                "start": result.get("start", 0),
                "end": result.get("end", 0),
                "model": self._loaded_model_name,
            }
        except Exception as e:
            logger.error("فشل الإجابة: %s", e)
            return {
                "answer": "",
                "score": 0.0,
                "start": 0,
                "end": 0,
                "model": self._loaded_model_name,
                "error": str(e),
            }

    def is_available(self) -> bool:
        """هل المحرك متاح؟"""
        return self._has_transformers and self._has_torch

الخطوة 2: التسجيل في `init.py` / Step 2: Register in `init.py`

# modules/nlp/__init__.py
from .qa_engine import QAEngine

__all__ = [
    "QAEngine",
    # ... existing exports
]

الخطوة 3: إضافة التبويب في `app.py` / Step 3: Add Tab in `app.py`

# app.py
import streamlit as st
from modules.nlp.qa_engine import QAEngine


def render_qa_tab(cfg, db):
    """تبويب الأسئلة والأجوبة."""
    st.header("❓ الأسئلة والأجوبة / Question Answering")

    # Initialize engine
    if "qa_engine" not in st.session_state:
        st.session_state.qa_engine = QAEngine(
            device="cuda" if cfg.use_gpu else "cpu"
        )

    qa = st.session_state.qa_engine

    # Input
    question = st.text_input("السؤال / Question", placeholder="e.g., What is machine learning?")
    context = st.text_area(
        "السياق / Context",
        placeholder="Paste the reference text here...",
        height=200,
    )

    if st.button("إجابة / Answer") and question and context:
        with st.spinner("جارٍ البحث عن الإجابة..."):
            result = qa.answer(question, context)

        if result.get("answer"):
            st.success(f"**الإجابة / Answer:** {result['answer']}")
            st.info(f"الثقة / Confidence: {result['score']:.1%}")
        else:
            st.error(f"لم يتم العثور على إجابة: {result.get('error', '')}")

الخطوة 4: تحديث `requirements.txt` / Step 4: Update Requirements

# NLP - Question Answering
datasets>=2.15.0
# (transformers و torch موجودان بالفعل)

الخطوة 5: كتابة الاختبارات / Step 5: Write Tests

# tests/test_qa_engine.py
"""
اختبارات محرك الأسئلة والأجوبة / QA Engine Tests
"""
import pytest
from modules.nlp.qa_engine import QAEngine


class TestQAEngine:
    """اختبارات فئة QAEngine."""

    def test_init(self):
        """اختبار التهيئة."""
        qa = QAEngine()
        assert qa._device in ("cpu", "cuda")
        assert qa._pipeline is None  # لا يُحمّل تلقائياً

    def test_init_with_model(self):
        """اختبار التهيئة بنموذج محدد."""
        qa = QAEngine(model_name="deepset/roberta-base-squad2")
        assert qa.model_name == "deepset/roberta-base-squad2"

    def test_answer_empty_question(self):
        """اختبار: سؤال فارغ."""
        qa = QAEngine()
        result = qa.answer("", "Some context")
        assert result["answer"] == ""
        assert "error" in result

    def test_answer_empty_context(self):
        """اختبار: سياق فارغ."""
        qa = QAEngine()
        result = qa.answer("What?", "")
        assert result["answer"] == ""
        assert "error" in result

    def test_answer_returns_expected_keys(self):
        """اختبار: مفاتيح النتيجة."""
        qa = QAEngine()
        result = qa.answer("Q", "C")
        expected_keys = {"answer", "score", "start", "end", "model"}
        assert expected_keys.issubset(result.keys())

    def test_is_available(self):
        """اختبار: توفر المحرك."""
        qa = QAEngine()
        # يجب أن يعيد True إذا كانت المكتبات مثبتة
        result = qa.is_available()
        assert isinstance(result, bool)

الخطوة 6: إضافة الإعداد في `config.py` / Step 6: Add Config Setting

# config.py - في OmniFileConfig dataclass
# QA Engine
enable_qa: bool = True
qa_model_name: str = "deepset/roberta-base-squad2"
qa_max_answer_length: int = 50

4. إضافة لغة جديدة / Adding a New Language

مثال: إضافة اللغة الفرنسية (FR) / Example: Adding French (FR)

الخطوة 1: تحديث `config.py` / Step 1: Update Config

# config.py
@dataclass
class OmniFileConfig:
    # ...
    supported_languages: list = field(
        default_factory=lambda: ["en", "ar", "de", "fr"]  # أضف "fr"
    )

الخطوة 2: إضافة مصحح في `spell_corrector.py` / Step 2: Add Corrector

# modules/nlp/spell_corrector.py

class SpellCorrector:
    def __init__(self, ...):
        # ...
        self.supported_languages = ["en", "ar", "de", "fr"]  # أضف "fr"

        # محاولة تحميل المصحح الفرنسي
        self._fr_corrector = None
        self._fr_available = False
        self._try_load_french_corrector()

    def _try_load_french_corrector(self) -> None:
        """محاولة تحميل مصحح الفرنسية (pyspellchecker)."""
        try:
            from spellchecker import SpellChecker
            self._fr_corrector = SpellChecker(language="fr")
            self._fr_available = True
            logger.info("تم تحميل مصحح الفرنسية (pyspellchecker) بنجاح")
        except ImportError:
            logger.warning("مكتبة pyspellchecker غير مثبتة. pip install pyspellchecker")
        except Exception as e:
            logger.warning("فشل تحميل مصحح الفرنسية: %s", e)

    @staticmethod
    def _is_french_word(word: str) -> bool:
        """هل الكلمة فرنسية؟"""
        french_chars = sum(1 for c in word if c in "àâäéèêëïîôùûüÿçœæÀÂÄÉÈÊËÏÎÔÙÛÜŸÇŒÆ")
        if french_chars > 0:
            return True
        latin_chars = sum(1 for c in word if c.isalpha() and c.isascii())
        arabic_chars = sum(1 for c in word if "\u0600" <= c <= "\u06FF")
        return latin_chars > len(word) * 0.5 and arabic_chars == 0

    def _correct_french_word(self, word: str) -> Optional[str]:
        """تصحيح كلمة فرنسية."""
        if word in self._protected_terms or word.lower() in self._protected_terms:
            return None
        learned = self._get_learned_correction(word)
        if learned:
            return learned
        if self._fr_available and self._fr_corrector:
            try:
                if len(word) <= 2:
                    return None
                if word.lower() in self._fr_corrector.word_frequency:
                    return None
                candidates = self._fr_corrector.correction(word)
                if candidates and candidates.lower() != word.lower():
                    if abs(len(candidates) - len(word)) <= 3:
                        return candidates
            except Exception as e:
                logger.debug("خطأ في تصحيح فرنسي '%s': %s", word, e)
        return None

    def correct_word(self, word: str) -> str:
        """تصحيح كلمة واحدة (محدّث لدعم الفرنسية)."""
        if self._should_skip_word(word):
            return word
        if self._is_arabic_word(word):
            correction = self._correct_arabic_word(word)
        elif self._is_french_word(word):      # <-- أضف هذا الشرط قبل الإنجليزية
            correction = self._correct_french_word(word)
        elif self._is_german_word(word):
            correction = self._correct_german_word(word)
        elif self._is_english_word(word):
            correction = self._correct_english_word(word)
        else:
            return word
        return correction if correction else word

    def is_available(self) -> dict:
        """فحص توفر المصححات (محدّث)."""
        return {
            "english": self._en_available,
            "arabic": self._ar_available,
            "german": self._de_available,
            "french": self._fr_available,  # <-- أضف هذا
            "learned": len(self._learned_corrections) > 0,
        }

الخطوة 3: إضافة نموذج ترجمة في `translator.py` / Step 3: Add Translation Model

# modules/nlp/translator.py

class TechnicalTranslator:
    TRANSLATION_MODELS: dict[str, str] = {
        # ... existing models ...
        # أضف الاتجاهات الجديدة:
        "en-fr": "Helsinki-NLP/opus-mt-en-fr",
        "fr-en": "Helsinki-NLP/opus-mt-fr-en",
        "fr-ar": "Helsinki-NLP/opus-mt-fr-ar",
        "ar-fr": "Helsinki-NLP/opus-mt-ar-fr",
        "fr-de": "Helsinki-NLP/opus-mt-fr-de",
        "de-fr": "Helsinki-NLP/opus-mt-de-fr",
    }

    SUPPORTED_LANGUAGES = ["en", "ar", "de", "fr"]  # أضف "fr"

الخطوة 4: إضافة نموذج تلخيص (اختياري) / Step 4: Add Summarization Model (Optional)

# modules/nlp/summarizer.py

class TextSummarizer:
    MODELS_BY_LANG = {
        # ... existing models ...
        "fr": [
            "mrm8488/camembert2camembert_shared-french-summarization",
        ],
    }

    FALLBACK_MODELS = {
        # ... existing models ...
        "fr": "mrm8488/camembert2camembert_shared-french-summarization",
    }

الخطوة 5: تحديث كشف اللغة / Step 5: Update Language Detection

# modules/nlp/summarizer.py - _detect_language()
@staticmethod
def _detect_language(text: str) -> str:
    """كشف لغة النص."""
    arabic_chars = sum(1 for c in text if "\u0600" <= c <= "\u06FF")
    german_chars = sum(1 for c in text if c in "äöüÄÖÜß")
    french_chars = sum(1 for c in text if c in "àâäéèêëïîôùûüÿçœæÀÂÄÉÈÊËÏÎÔÙÛÜŸÇŒÆ")

    total_alpha = sum(1 for c in text if c.isalpha())
    if total_alpha == 0:
        return "en"

    if arabic_chars / total_alpha > 0.3:
        return "ar"
    if french_chars / total_alpha > 0.05:   # <-- أضف هذا قبل الألمانية
        return "fr"
    if german_chars / total_alpha > 0.05:
        return "de"
    return "en"

الخطوة 6: اختبارات اللغة الجديدة / Step 6: Tests for New Language

# tests/test_spell_corrector.py
class TestFrenchCorrection:
    """اختبارات التصحيح الفرنسي."""

    def test_correct_french_word(self):
        """اختبار تصحيح كلمة فرنسية."""
        corrector = SpellCorrector()
        if corrector.is_available().get("french"):
            result = corrector.correct_word("bonjor")  # خطأ إملائي
            assert result == "bonjour"

    def test_protect_french_code(self):
        """اختبار حماية الكود مع فرنسية."""
        corrector = SpellCorrector()
        result = corrector.correct_word("numpy")  # لا يُصحح
        assert result == "numpy"

5. إضافة محرك OCR جديد / Adding a New OCR Engine

واجهة المحرك / Engine Interface

كل محرك OCR جديد يجب أن يوفر الواجهة التالية: Every new OCR engine must provide the following interface:

from typing import Optional, Union
import numpy as np
from PIL import Image


class NewOCREngine:
    """
    واجهة محرك OCR جديد.
    يجب أن يوفر: recognize, recognize_batch, recognize_pdf, get_available_engines, unload_models
    """

    def __init__(self, **kwargs):
        """تهيئة المحرك."""
        self._model = None
        self._loaded = False

    def recognize(
        self,
        image: Union[np.ndarray, Image.Image],
        languages: Optional[list[str]] = None,
    ) -> dict:
        """
        التعرف على النص في صورة واحدة.

        المعاملات:
            image: صورة PIL أو numpy array
            languages: لغات مطلوبة

        العائد:
            dict: {
                "text": str,           # النص المستخرج
                "confidence": float,   # مستوى الثقة (0-1)
                "source": str,         # اسم المحرك
                "word_count": int,     # عدد الكلمات
                "words": list[dict],   # تفاصيل الكلمات مع الإحداثيات
                "processing_time": float,
                "details": dict,
            }
        """
        raise NotImplementedError

    def recognize_batch(
        self,
        images: list[Union[np.ndarray, Image.Image]],
        languages: Optional[list[str]] = None,
    ) -> list[dict]:
        """التعرف على مجموعة صور."""
        return [self.recognize(img, languages) for img in images]

    def recognize_pdf(
        self,
        pdf_path: str,
        pages: Optional[list[int]] = None,
        languages: Optional[list[str]] = None,
    ) -> list[dict]:
        """استخراج النص من PDF."""
        raise NotImplementedError

    def get_available_engines(self) -> list[dict]:
        """قائمة المحركات المتاحة."""
        return [{"name": "NewOCR", "available": True, "enabled": True}]

    def unload_models(self) -> None:
        """تفريغ النماذج من الذاكرة."""
        self._model = None
        self._loaded = False

التسجيل في `OCREngine` / Register in OCREngine

# modules/vision/ocr_engine.py

class OCREngine:
    def __init__(self, ...):
        # ... existing engines ...
        self._new_ocr_reader = None
        self._new_ocr_loaded = False

    def _recognize_new_ocr(self, image):
        """التعرف باستخدام المحرك الجديد."""
        # ... implementation ...
        pass

    def recognize(self, image, languages=None):
        # ... existing code ...
        # أضف المحرك الجديد في سلسلة المحركات:
        if self.enable_new_ocr:
            new_result = self._recognize_new_ocr(pil_image)
            if new_result:
                results.append(new_result)
        # ... rest of method ...

6. الاختبارات / Testing

تشغيل الاختبارات / Running Tests

# تشغيل جميع الاختبارات / Run all tests
pytest tests/ -v

# تشغيل اختبار محدد / Run specific test
pytest tests/test_spell_corrector.py -v

# تشغيل مع تغطية الكود / Run with coverage
pytest tests/ --cov=modules --cov-report=html

# تشغيل اختبارات وحدة محددة / Run specific test class
pytest tests/test_spell_corrector.py::TestSpellCorrector::test_correct_word -v

# عرض الإخراج المطبوع / Show print output
pytest tests/ -v -s

كتابة اختبار جديد / Writing a New Test

# tests/test_qa_engine.py
"""
اختبارات محرك QA / QA Engine Tests

النمط المتبع: Arrange - Act - Assert
"""
import pytest


class TestQAEngine:
    """اختبارات فئة QAEngine."""

    def setup_method(self):
        """إعداد قبل كل اختبار (Arrange)."""
        from modules.nlp.qa_engine import QAEngine
        self.qa = QAEngine()

    # === اختبارات الحالة السعيدة / Happy Path Tests ===

    def test_answer_returns_dict(self):
        """اختبار: answer() يعيد dict."""
        result = self.qa.answer("What?", "Context text")
        assert isinstance(result, dict)

    def test_answer_has_required_keys(self):
        """اختبار: النتيجة تحتوي المفاتيح المطلوبة."""
        result = self.qa.answer("Q", "C")
        for key in ["answer", "score", "start", "end", "model"]:
            assert key in result, f"Missing key: {key}"

    def test_answer_score_is_float(self):
        """اختبار: score هو رقم بين 0 و 1."""
        result = self.qa.answer("Q", "C")
        assert isinstance(result["score"], (int, float))
        assert 0 <= result["score"] <= 1

    # === اختبارات الحالات الخطأ / Error Case Tests ===

    def test_answer_empty_question(self):
        """اختبار: سؤال فارغ يعيد خطأ."""
        result = self.qa.answer("", "Context")
        assert result["answer"] == ""
        assert "error" in result

    def test_answer_empty_context(self):
        """اختبار: سياق فارغ يعيد خطأ."""
        result = self.qa.answer("Question", "")
        assert result["answer"] == ""
        assert "error" in result

    def test_answer_none_inputs(self):
        """اختبار: مدخلات None لا تسبب crash."""
        result = self.qa.answer(None, None)
        assert isinstance(result, dict)

    # === اختبارات التهيئة / Initialization Tests ===

    def test_lazy_loading(self):
        """اختبار: النموذج لا يُحمّل تلقائياً."""
        qa = QAEngine()
        assert qa._pipeline is None

    def test_custom_model_name(self):
        """اختبار: تمرير اسم نموذج مخصص."""
        qa = QAEngine(model_name="custom-model")
        assert qa.model_name == "custom-model"

Fixtures المشتركة / Shared Fixtures

# tests/conftest.py
"""
Fixtures المشتركة لجميع الاختبارات.
"""
import pytest
import numpy as np
from PIL import Image


@pytest.fixture
def sample_text_en():
    """نص إنجليزي بسيط."""
    return "This is a sample text for testing."


@pytest.fixture
def sample_text_ar():
    """نص عربي بسيط."""
    return "هذا نص عربي للاختبار."


@pytest.fixture
def sample_image():
    """صورة اختبار بسيطة."""
    return Image.new("RGB", (200, 100), color="white")


@pytest.fixture
def sample_pdf_path(tmp_path):
    """مسار PDF اختبار مؤقت."""
    pdf_path = tmp_path / "test.pdf"
    return str(pdf_path)


@pytest.fixture
def sample_config():
    """إعدادات اختبار."""
    from config import OmniFileConfig
    return OmniFileConfig(
        use_gpu=False,
        enable_trocr=False,
        enable_easyocr=False,
        enable_tesseract=False,
    )


@pytest.fixture
def sample_db(tmp_path):
    """قاعدة بيانات اختبار مؤقتة."""
    from database import OmniFileDB
    db_path = str(tmp_path / "test.db")
    db = OmniFileDB(db_path)
    db.create_tables()
    yield db
    # cleanup

7. النشر / Deployment

7.1 HuggingFace Spaces (Docker) / HuggingFace Spaces

المشروع يحتوي Dockerfile جاهز. للنشر: The project includes a ready Dockerfile. To deploy:

# Dockerfile (موجود بالفعل / already exists)
FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    poppler-utils \
    tesseract-ocr \
    tesseract-ocr-ara \
    tesseract-ocr-eng \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy all project files
COPY . .

# Expose Streamlit default port
EXPOSE 7860

# Health check
HEALTHCHECK CMD curl -f http://localhost:7860/_stcore/health || exit 1

# Run Streamlit
CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

7.2 Docker Compose (مع Redis) / Docker Compose (with Redis)

# docker-compose.yml
version: '3.8'

services:
  omnifile:
    build: .
    ports:
      - "7860:7860"
    environment:
      - ENABLE_CELERY=true
      - CELERY_BROKER_URL=redis://redis:6379/0
    volumes:
      - ./data:/app/data
      - ./models_cache:/app/models_cache
      - ./database:/app/database
    depends_on:
      - redis
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  celery-worker:
    build: .
    command: celery -A tasks worker --loglevel=info
    environment:
      - ENABLE_CELERY=true
      - CELERY_BROKER_URL=redis://redis:6379/0
    volumes:
      - ./data:/app/data
      - ./models_cache:/app/models_cache
    depends_on:
      - redis

volumes:
  redis_data:

# تشغيل / Run
docker-compose up -d

# مشاهدة السجلات / View logs
docker-compose logs -f omnifile

# إيقاف / Stop
docker-compose down

7.3 Google Colab / Google Colab

# main.py --colab
# أو:
from config import OmniFileConfig
cfg = OmniFileConfig.from_colab_drive()
cfg.setup_environment()

# ثم:
!streamlit run app.py --server.port 7860

7.4 AWS / Azure / GCP

# باستخدام Docker / Using Docker
# AWS ECS:
docker build -t omnifile .
aws ecs register-task-definition --family omnifile --container-definitions ...

# Azure Container Apps:
az containerapp create --image omnifile ...

# GCP Cloud Run:
gcloud run deploy omnifile --source . --port 7860

7.5 Celery Workers (المعالجة غير المتزامنة) / Celery Workers

# 1. تشغيل Redis
redis-server &

# 2. تشغيل Worker
celery -A tasks worker --loglevel=info --concurrency=4

# 3. تشغيل Worker مع مراقبة
celery -A tasks worker --loglevel=info --concurrency=2 \
  --max-tasks-per-child=1000 \
  --time-limit=300 \
  --soft-time-limit=240

# 4. مراقبة المهام
celery -A tasks inspect active
celery -A tasks inspect reserved
celery -A tasks events --dump

8. معايير الكود / Code Standards

8.1 التنسيق / Formatting

# 1. Python 3.8+ type hints
def process_text(text: str, language: str = "en") -> dict:
    ...

# 2. توثيق عربي + إنجليزي
class SpellCorrector:
    """
    مصحح إملائي ذكي — يدعم العربية والإنجليزية مع حماية المصطلحات البرمجية.
    Smart spell corrector — supports Arabic and English with code term protection.

    الميزات / Features:
        - تصحيح متعدد اللغات
        - حماية المصطلحات التقنية
        - تعلم من المستخدم
    """

    def correct_word(self, word: str) -> str:
        """
        تصحيح كلمة واحدة.
        Correct a single word.

        المعاملات:
            word: الكلمة المراد تصحيحها / The word to correct

        العائد:
            الكلمة المصححة / The corrected word
        """
        ...

8.2 التسجيل / Logging

# استخدم logging بدلاً من print / Use logging instead of print

import logging
logger = logging.getLogger(__name__)

# الصحيح / Correct:
logger.info("تم تحميل النموذج: %s", model_name)
logger.warning("فشل التحميل: %s", error)
logger.error("خطأ حرج: %s", error, exc_info=True)
logger.debug("قيمة متغيرة: %s", value)

# الخاطئ / Wrong:
print("تم التحميل")           # لا تستخدم print في الإنتاج
print(f"Error: {error}")      # لا تستخدم print

8.3 معالجة الأخطاء / Error Handling

# 1. Graceful Degradation - انحطاط سلس
try:
    from presidio_analyzer import AnalyzerEngine
    self._analyzer = AnalyzerEngine()
    self._presidio_available = True
except ImportError:
    logger.warning("presidio غير مثبت. سيتم استخدام Regex فقط.")
    self._presidio_available = False
except Exception as e:
    logger.warning("فشل تحميل presidio: %s", e)
    self._presidio_available = False

# 2. لا تستخدم except: فارغ / Never use bare except:
try:
    ...
except:  # خاطئ / Wrong
    pass

# 3. حدد نوع الاستثناء / Specify exception type
try:
    ...
except (ValueError, TypeError) as e:  # صحيح / Correct
    logger.error("خطأ: %s", e)

8.4 تحميل بطيء / Lazy Loading

# النماذج الثقيلة لا تُحمّل في __init__ / Heavy models don't load in __init__

class MyEngine:
    def __init__(self):
        self._model = None        # لا تحميل هنا / Don't load here
        self._loaded = False

    def _load_model(self) -> bool:
        """تحميل النموذج عند أول استخدام / Load model on first use."""
        if self._loaded:
            return True
        try:
            self._model = load_heavy_model()
            self._loaded = True
            return True
        except Exception as e:
            logger.error("فشل التحميل: %s", e)
            return False

    def process(self, data):
        if not self._load_model():  # تحميل عند الحاجة / Load when needed
            return self._fallback(data)
        return self._model.process(data)

8.5 إدارة الذاكرة / Memory Management

# 1. تنظيف الذاكرة بعد المعالجة
import gc
import torch

def process_batch(images):
    results = []
    for img in images:
        result = process(img)
        results.append(result)
    
    # تنظيف / Cleanup
    torch.cuda.empty_cache()
    gc.collect()
    return results

# 2. تفريغ النماذج عند الحاجة
def unload_models(self):
    """تفريغ النماذج من الذاكرة."""
    self._model = None
    self._loaded = False
    torch.cuda.empty_cache()
    gc.collect()

8.6 دعم اللغات / Language Support

# كل نص مرئي للمستخدم يجب أن يكون ثنائي اللغة
# All user-visible text must be bilingual

# أسماء المتغيرات والدوال: إنجليزية / Variable and function names: English
def correct_text(text: str) -> dict:

# الرسائل والتوثيق: عربي + إنجليزي / Messages and docs: Arabic + English
logger.info("تم تحميل المصحح بنجاح / Spell corrector loaded successfully")
st.success("تمت المعالجة بنجاح / Processing completed successfully")

# أخطاء: عربي + إنجليزي / Errors: Arabic + English
raise ValueError("حقل file_name مطلوب / file_name field is required")

9. GitHub Actions CI/CD

9.1 اختبارات تلقائية / Automated Tests

# .github/workflows/tests.yml
name: Tests

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ['3.9', '3.10', '3.11']

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install system dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y tesseract-ocr tesseract-ocr-ara

      - name: Install Python dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          pip install pytest pytest-cov

      - name: Run tests
        run: |
          pytest tests/ -v --cov=modules --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          file: ./coverage.xml

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install ruff
      - run: ruff check modules/ src/ tests/

9.2 بناء Docker / Docker Build

# .github/workflows/docker.yml
name: Docker Build

on:
  push:
    tags: ['v*']
  workflow_dispatch:

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to DockerHub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ${{ secrets.DOCKER_USERNAME }}/omnifile-processor:latest
            ${{ secrets.DOCKER_USERNAME }}/omnifile-processor:${{ github.ref_name }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

9.3 نشر على HuggingFace Spaces / Deploy to HuggingFace Spaces

# .github/workflows/deploy.yml
name: Deploy to HuggingFace

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Push to HuggingFace
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
        run: |
          git clone https://DrAbdulmalek:$HF_TOKEN@huggingface.co/spaces/DrAbdulmalek/OmniFile_Processor hf_space
          rsync -av --delete --exclude='.git' ./ hf_space/
          cd hf_space
          git add .
          git commit -m "Deploy from GitHub Actions"
          git push

المؤلف / Author: Dr Abdulmalek Tamer Al-husseini 📍 الموقع / Location: Homs, Syria 📧 البريد / Email: Abdulmalek.husseini@gmail.com الإصدار / Version: 2.1 الترخيص / License: راجع ملف LICENSE المساهمة / Contributing: Pull Requests مرحب بها على branch develop