github-actions[bot]
add documentation
492875b

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Matn - Arabic OCR for Classical Islamic Texts

Project Overview

We're building an end-to-end machine learning system that can:

  • Extract text from classical Arabic Islamic manuscript images
  • Provide structured output with proper formatting
  • Handle diacritics and classical Arabic conventions
  • Deploy as a production-ready service

Dataset: mssqpi/Arabic-OCR-Dataset (2.16M image-text pairs) Model: DeepSeek-OCR (fine-tuned with LoRA via Unsloth) Architecture: Vision Transformer Encoder β†’ Language Model Decoder Trained Model: https://huggingface.co/emadahmed97/matn-ocr-arabic-finetuned

Implementation Phases

Phase 1: Introduction & Setup

1.1 Environment Setup

  • βœ… Install required dependencies (datasets, transformers, unsloth)
  • βœ… Explore Arabic OCR dataset structure
  • βœ… Set up DeepSeek-OCR model integration
  • βœ… Configure Arabic text processing pipeline

1.2 Data Exploration & Analysis (EDA)

  • βœ… Dataset statistics and sample analysis
  • βœ… Arabic text characteristics analysis
  • βœ… Classical Islamic text patterns identification
  • βœ… Diacritics and formatting analysis

1.3 MLflow Integration for Arabic OCR

  • βœ… Configure MLflow for OCR experiments
  • βœ… Set up Arabic text evaluation metrics
  • βœ… Create OCR-specific logging and tracking

Phase 2: Training Pipeline Development

2.1 Data Loading & Preprocessing

  • βœ… Load mssqpi/Arabic-OCR-Dataset via HuggingFace datasets
  • βœ… Implement Arabic text normalization
  • βœ… Convert dataset to conversation format for fine-tuning
  • βœ… DeepSeekOCRDataCollator for image-text pair processing

2.2 Model Architecture Setup

  • βœ… DeepSeek-OCR as base vision-language model
  • βœ… Configure LoRA fine-tuning for efficient training (2% of parameters)
  • βœ… Set up Unsloth for 2x faster training
  • βœ… Implement conversation-based training format

2.3 Cross-Validation Strategy

  • πŸ”² Adapt cross-validation for OCR tasks
  • πŸ”² Implement text-based evaluation splits
  • πŸ”² Handle Arabic text-specific validation

2.4 Training Implementation

  • βœ… Fine-tune DeepSeek-OCR with LoRA adapters
  • βœ… Implement production training pipeline with MLflow tracking
  • βœ… Configure training hyperparameters for efficient fine-tuning
  • βœ… Add conversation format data processing

2.5 Evaluation Metrics

  • βœ… Character Error Rate (CER)
  • βœ… Word Error Rate (WER)
  • βœ… BLEU score for text quality
  • βœ… Diacritic accuracy assessment
  • βœ… Islamic terminology recognition accuracy

2.6 Model Registration

  • βœ… Register best performing models (integrated in training pipeline)
  • βœ… Version control for Arabic OCR models (via MLflow tracking)
  • βœ… Model metadata and documentation (automated via pipeline)

Phase 3: MLOps Automation Pipeline

3.1 GitHub Actions Automation

  • βœ… Create workflow for automated training triggers
  • βœ… Set up data validation and testing pipeline
  • βœ… Implement automated model performance gating
  • βœ… Add single-environment deployment (direct to prod)

3.2 HuggingFace Spaces Training Environment

  • βœ… Set up GPU-enabled training space (L4 GPU)
  • βœ… Create Gradio interface for manual training
  • βœ… Implement REST API for automated training calls
  • βœ… Add real-time training progress monitoring

3.3 Model Registry & Versioning

3.4 Inference Pipeline

  • βœ… Add Inference tab to HF Spaces Gradio UI (upload image β†’ OCR text output)
  • βœ… Add /api/infer REST endpoint for programmatic inference
  • βœ… Load LoRA model from HF Hub (emadahmed97/matn-ocr-arabic-finetuned)
  • βœ… Handle RTL text formatting and confidence scoring

Phase 4: Evaluation & Monitoring Pipeline

Comprehensive monitoring and evaluation system:

4.1 Automated Evaluation Metrics

  • πŸ”² Real-time CER/WER/BLEU calculation during training
  • πŸ”² Arabic-specific metrics (diacritic accuracy, Islamic terminology)
  • πŸ”² Performance benchmarking against baseline models
  • πŸ”² Automated model comparison and ranking

4.2 Production Model Monitoring

  • πŸ”² OCR accuracy tracking in production
  • πŸ”² Model drift detection (performance degradation)
  • πŸ”² Latency and throughput monitoring
  • πŸ”² Cost tracking (GPU usage, API calls)

4.3 Data Quality Monitoring

  • πŸ”² Input image quality assessment
  • πŸ”² Arabic text output validation
  • πŸ”² Character distribution monitoring
  • πŸ”² Detection of adversarial or out-of-domain inputs

4.4 MLOps Monitoring Dashboard

  • πŸ”² Training pipeline health and status
  • πŸ”² Model performance trends over time
  • πŸ”² A/B testing results visualization
  • πŸ”² Automated alerting for performance issues

4.5 Continuous Evaluation & Testing

  • πŸ”² Automated testing pipeline with held-out datasets
  • πŸ”² Synthetic Arabic manuscript generation for testing
  • πŸ”² Human evaluation workflow integration
  • πŸ”² Automated retraining triggers based on performance

Potential Future Work

Model Serving (Standalone)

  • MLflow model serving / MLServer integration
  • Scalable inference backend with load balancing and caching
  • Performance optimization

Cloud Deployment

  • CloudFormation / SageMaker endpoint configuration
  • Auto-scaling, monitoring, and cost optimization
  • Remote MLflow tracking server with S3 artifact storage

Technical Specifications

Model Architecture

Input: Manuscript Image (PNG/JPEG)
  ↓
DeepSeek-OCR Vision Encoder
  ↓
Language Model Decoder (with LoRA adapters)
  ↓
Output: Arabic Text

Training Pipeline

mssqpi/Arabic-OCR-Dataset (2.16M samples)
  ↓
Conversation Format (User: <image> + prompt, Assistant: text)
  ↓
DeepSeekOCRDataCollator (image preprocessing + tokenization)
  ↓
LoRA Fine-tuning via Unsloth (2% of parameters)
  ↓
Push LoRA adapters to HuggingFace Hub

Inference Pipeline

Upload Image β†’ Load Base Model + LoRA Adapters β†’ model.infer() β†’ Arabic Text

Evaluation Pipeline

OCR Output β†’ Character/Word Error Rate
           β†’ BLEU Score
           β†’ Diacritic Accuracy
           β†’ Islamic Term Recognition

Arabic-Specific Considerations

  • Right-to-Left text direction
  • Connected letterforms with contextual shapes
  • Diacritics preservation for classical texts
  • Islamic terminology and abbreviations
  • Historical spelling variations
  • Multi-column manuscript layouts

Success Metrics

Model Performance

  • Character Error Rate < 5% for printed text
  • Word Error Rate < 10% for classical manuscripts
  • Diacritic Accuracy > 90% for vowelized text
  • Processing Speed < 2 seconds per page
  • Model Size < 1GB for deployment efficiency

MLOps Automation

  • End-to-end automation: Code push β†’ Auto train β†’ Auto deploy < 1 hour
  • Training cost efficiency: < $10 per training run on L4 GPU
  • Deployment reliability: 99.9% uptime with auto-scaling
  • Model versioning: 100% reproducible experiments
  • Monitoring coverage: Real-time alerts for performance degradation

Complete MLOps Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Developer     β”‚    β”‚  GitHub Actions  β”‚    β”‚  HF Spaces GPU  β”‚
β”‚   Push Code     │───▢│  Trigger Train   │───▢│   LoRA Finetune β”‚
β”‚   Update Data   β”‚    β”‚  Run Tests       β”‚    β”‚   MLflow Track  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Production    │◄───│  Model Registry  │◄───│  Auto Evaluate β”‚
β”‚   Deployment    β”‚    β”‚  A/B Testing     β”‚    β”‚  Performance    β”‚
β”‚   Auto-scale    β”‚    β”‚  Version Control β”‚    β”‚  Gate Release   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Current Status

Completed

  • Phase 1: Introduction & Setup (Arabic text processing, MLflow integration)
  • Phase 2: Training Pipeline Development (DeepSeek-OCR + LoRA fine-tuning)
  • Phase 3: MLOps Automation
    • βœ… GitHub Actions workflow for automated training triggers
    • βœ… HF Spaces training + inference environment with Gradio UI + REST API (L4 GPU)
    • βœ… DeepSeekOCRDataCollator ported from notebook
    • βœ… Auto-push trained LoRA to HF Hub
    • βœ… MLflow experiment tracking (local SQLite on Space)
    • βœ… Inference tab with RTL output + /api/infer endpoint
    • βœ… Model loading: base DeepSeek-OCR + LoRA adapters from Hub
    • βœ… Repo consolidation: single repo syncs to HF Spaces via GitHub Actions
    • βœ… sync-to-hf-spaces.yml workflow for auto-deploy on push to main

Up Next

  • Phase 4: Evaluation & Monitoring Pipeline

Matn - Arabic OCR for classical Islamic texts, powered by DeepSeek-OCR with LoRA fine-tuning.