A newer version of the Gradio SDK is available: 6.13.0
Matn - Arabic OCR for Classical Islamic Texts
Project Overview
We're building an end-to-end machine learning system that can:
- Extract text from classical Arabic Islamic manuscript images
- Provide structured output with proper formatting
- Handle diacritics and classical Arabic conventions
- Deploy as a production-ready service
Dataset: mssqpi/Arabic-OCR-Dataset (2.16M image-text pairs) Model: DeepSeek-OCR (fine-tuned with LoRA via Unsloth) Architecture: Vision Transformer Encoder β Language Model Decoder Trained Model: https://huggingface.co/emadahmed97/matn-ocr-arabic-finetuned
Implementation Phases
Phase 1: Introduction & Setup
1.1 Environment Setup
- β Install required dependencies (datasets, transformers, unsloth)
- β Explore Arabic OCR dataset structure
- β Set up DeepSeek-OCR model integration
- β Configure Arabic text processing pipeline
1.2 Data Exploration & Analysis (EDA)
- β Dataset statistics and sample analysis
- β Arabic text characteristics analysis
- β Classical Islamic text patterns identification
- β Diacritics and formatting analysis
1.3 MLflow Integration for Arabic OCR
- β Configure MLflow for OCR experiments
- β Set up Arabic text evaluation metrics
- β Create OCR-specific logging and tracking
Phase 2: Training Pipeline Development
2.1 Data Loading & Preprocessing
- β
Load
mssqpi/Arabic-OCR-Datasetvia HuggingFace datasets - β Implement Arabic text normalization
- β Convert dataset to conversation format for fine-tuning
- β DeepSeekOCRDataCollator for image-text pair processing
2.2 Model Architecture Setup
- β DeepSeek-OCR as base vision-language model
- β Configure LoRA fine-tuning for efficient training (2% of parameters)
- β Set up Unsloth for 2x faster training
- β Implement conversation-based training format
2.3 Cross-Validation Strategy
- π² Adapt cross-validation for OCR tasks
- π² Implement text-based evaluation splits
- π² Handle Arabic text-specific validation
2.4 Training Implementation
- β Fine-tune DeepSeek-OCR with LoRA adapters
- β Implement production training pipeline with MLflow tracking
- β Configure training hyperparameters for efficient fine-tuning
- β Add conversation format data processing
2.5 Evaluation Metrics
- β Character Error Rate (CER)
- β Word Error Rate (WER)
- β BLEU score for text quality
- β Diacritic accuracy assessment
- β Islamic terminology recognition accuracy
2.6 Model Registration
- β Register best performing models (integrated in training pipeline)
- β Version control for Arabic OCR models (via MLflow tracking)
- β Model metadata and documentation (automated via pipeline)
Phase 3: MLOps Automation Pipeline
3.1 GitHub Actions Automation
- β Create workflow for automated training triggers
- β Set up data validation and testing pipeline
- β Implement automated model performance gating
- β Add single-environment deployment (direct to prod)
3.2 HuggingFace Spaces Training Environment
- β Set up GPU-enabled training space (L4 GPU)
- β Create Gradio interface for manual training
- β Implement REST API for automated training calls
- β Add real-time training progress monitoring
3.3 Model Registry & Versioning
- β Auto-push trained LoRA adapters to HuggingFace Hub (via HF_TOKEN + HF_MODEL_REPO)
- β Model saved at: https://huggingface.co/emadahmed97/matn-ocr-arabic-finetuned
- π² A/B testing infrastructure setup
- π² Model promotion workflow (dev β staging β prod)
3.4 Inference Pipeline
- β Add Inference tab to HF Spaces Gradio UI (upload image β OCR text output)
- β
Add
/api/inferREST endpoint for programmatic inference - β
Load LoRA model from HF Hub (
emadahmed97/matn-ocr-arabic-finetuned) - β Handle RTL text formatting and confidence scoring
Phase 4: Evaluation & Monitoring Pipeline
Comprehensive monitoring and evaluation system:
4.1 Automated Evaluation Metrics
- π² Real-time CER/WER/BLEU calculation during training
- π² Arabic-specific metrics (diacritic accuracy, Islamic terminology)
- π² Performance benchmarking against baseline models
- π² Automated model comparison and ranking
4.2 Production Model Monitoring
- π² OCR accuracy tracking in production
- π² Model drift detection (performance degradation)
- π² Latency and throughput monitoring
- π² Cost tracking (GPU usage, API calls)
4.3 Data Quality Monitoring
- π² Input image quality assessment
- π² Arabic text output validation
- π² Character distribution monitoring
- π² Detection of adversarial or out-of-domain inputs
4.4 MLOps Monitoring Dashboard
- π² Training pipeline health and status
- π² Model performance trends over time
- π² A/B testing results visualization
- π² Automated alerting for performance issues
4.5 Continuous Evaluation & Testing
- π² Automated testing pipeline with held-out datasets
- π² Synthetic Arabic manuscript generation for testing
- π² Human evaluation workflow integration
- π² Automated retraining triggers based on performance
Potential Future Work
Model Serving (Standalone)
- MLflow model serving / MLServer integration
- Scalable inference backend with load balancing and caching
- Performance optimization
Cloud Deployment
- CloudFormation / SageMaker endpoint configuration
- Auto-scaling, monitoring, and cost optimization
- Remote MLflow tracking server with S3 artifact storage
Technical Specifications
Model Architecture
Input: Manuscript Image (PNG/JPEG)
β
DeepSeek-OCR Vision Encoder
β
Language Model Decoder (with LoRA adapters)
β
Output: Arabic Text
Training Pipeline
mssqpi/Arabic-OCR-Dataset (2.16M samples)
β
Conversation Format (User: <image> + prompt, Assistant: text)
β
DeepSeekOCRDataCollator (image preprocessing + tokenization)
β
LoRA Fine-tuning via Unsloth (2% of parameters)
β
Push LoRA adapters to HuggingFace Hub
Inference Pipeline
Upload Image β Load Base Model + LoRA Adapters β model.infer() β Arabic Text
Evaluation Pipeline
OCR Output β Character/Word Error Rate
β BLEU Score
β Diacritic Accuracy
β Islamic Term Recognition
Arabic-Specific Considerations
- Right-to-Left text direction
- Connected letterforms with contextual shapes
- Diacritics preservation for classical texts
- Islamic terminology and abbreviations
- Historical spelling variations
- Multi-column manuscript layouts
Success Metrics
Model Performance
- Character Error Rate < 5% for printed text
- Word Error Rate < 10% for classical manuscripts
- Diacritic Accuracy > 90% for vowelized text
- Processing Speed < 2 seconds per page
- Model Size < 1GB for deployment efficiency
MLOps Automation
- End-to-end automation: Code push β Auto train β Auto deploy < 1 hour
- Training cost efficiency: < $10 per training run on L4 GPU
- Deployment reliability: 99.9% uptime with auto-scaling
- Model versioning: 100% reproducible experiments
- Monitoring coverage: Real-time alerts for performance degradation
Complete MLOps Workflow
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Developer β β GitHub Actions β β HF Spaces GPU β
β Push Code βββββΆβ Trigger Train βββββΆβ LoRA Finetune β
β Update Data β β Run Tests β β MLflow Track β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Production ββββββ Model Registry ββββββ Auto Evaluate β
β Deployment β β A/B Testing β β Performance β
β Auto-scale β β Version Control β β Gate Release β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
Current Status
Completed
- Phase 1: Introduction & Setup (Arabic text processing, MLflow integration)
- Phase 2: Training Pipeline Development (DeepSeek-OCR + LoRA fine-tuning)
- Phase 3: MLOps Automation
- β GitHub Actions workflow for automated training triggers
- β HF Spaces training + inference environment with Gradio UI + REST API (L4 GPU)
- β DeepSeekOCRDataCollator ported from notebook
- β Auto-push trained LoRA to HF Hub
- β MLflow experiment tracking (local SQLite on Space)
- β
Inference tab with RTL output +
/api/inferendpoint - β Model loading: base DeepSeek-OCR + LoRA adapters from Hub
- β Repo consolidation: single repo syncs to HF Spaces via GitHub Actions
- β
sync-to-hf-spaces.ymlworkflow for auto-deploy on push to main
Up Next
- Phase 4: Evaluation & Monitoring Pipeline
Matn - Arabic OCR for classical Islamic texts, powered by DeepSeek-OCR with LoRA fine-tuning.