| --- |
| base_model: deepseek-ai/DeepSeek-OCR-2 |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| license: apache-2.0 |
| language: |
| - ar |
| tags: |
| - ocr |
| - image-text-to-text |
| - vision-language |
| - document-understanding |
| - json-extraction |
| - arabic |
| - deepseek_vl_v2 |
| - unsloth |
| - lora |
| - peft |
| --- |
| |
| # deepseek_ocr2_arabic_jsonify |
| |
| `deepseek_ocr2_arabic_jsonify` is a task-specific fine-tune of `deepseek-ai/DeepSeek-OCR-2` for OCR-to-JSON extraction on building regulation pages. It is trained to read a single document page image and return one strict JSON object containing the page header fields and regulation table fields without extra explanation text. |
|
|
| The training workflow in the notebook loads the Unsloth-compatible `unsloth/DeepSeek-OCR-2` checkpoint, which maps to the same DeepSeek OCR 2 base model family, then fine-tunes it with LoRA for structured extraction. |
|
|
| ## Intended use |
|
|
| - Extract structured data from scanned or photographed building regulation pages. |
| - Return JSON only. |
| - Preserve the original document language and values exactly when possible, especially Arabic text, numbers, punctuation, and line breaks. |
| - Use empty strings for missing or unreadable fields instead of hallucinating values. |
|
|
| ## Output schema |
|
|
| The model was trained to produce this exact JSON structure and key order: |
|
|
| ```json |
| { |
| "header": { |
| "municipality": "", |
| "district_name": "", |
| "plan_number": "", |
| "plot_number": "", |
| "block_number": "", |
| "division_area": "" |
| }, |
| "table": { |
| "building_regulations": "", |
| "building_usage": "", |
| "setback": "", |
| "heights": "", |
| "building_factor": "", |
| "building_ratio": "", |
| "parking_requirements": "", |
| "notes": "" |
| } |
| } |
| ``` |
|
|
| ## Prompt format |
|
|
| The notebook converts each sample into a 3-message conversation: |
|
|
| ```json |
| [ |
| { |
| "role": "<|System|>", |
| "content": "Extract only the header and table fields and return one valid JSON object." |
| }, |
| { |
| "role": "<|User|>", |
| "content": "<image>\n.", |
| "images": ["document-page-image"] |
| }, |
| { |
| "role": "<|Assistant|>", |
| "content": "{...gold JSON...}" |
| } |
| ] |
| ``` |
|
|
| The system instruction also enforces JSON-only output, original-language preservation, no extra keys, and empty-string fallback for missing fields. |
|
|
| ## Training data |
|
|
| - Custom dataset of 108 document-page images paired with gold JSON extraction targets. |
| - Domain: Riyadh municipal building regulation pages. |
| - Source format: local `data.jsonl` with fields `image`, `text`, `transformed_text_to_json`, and `transformed_text_to_json_translated_to_English`. |
| - Training target: the `text` field, which contains the expected JSON output. |
|
|
| ## Training details |
|
|
| - Base model: `deepseek-ai/DeepSeek-OCR-2` |
| - Fine-tuning framework: Unsloth with Hugging Face Transformers/TRL |
| - Hardware used for the recorded run: `NVIDIA A100-SXM4-40GB` |
| - Image settings: `image_size=1024`, `base_size=1024`, `crop_mode=True` |
| - LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| - LoRA config: `r=32`, `lora_alpha=64`, `lora_dropout=0` |
| - Precision: `bf16` when supported |
| - Per-device batch size: `2` |
| - Gradient accumulation steps: `4` |
| - Effective batch size: `8` |
| - Learning rate: `2e-4` |
| - Optimizer: `adamw_8bit` |
| - LR scheduler: `linear` |
| - Epochs in the recorded run: `8` |
| - Actual training steps in the recorded run: `112` |
| - Train on responses only: `True` |
| - Trainable parameters: `172,615,680 / 3,561,735,040` (`4.85%`) |
| - Training runtime: `1548.2392` seconds (`25.8` minutes) |
| - Peak reserved GPU memory: `39.686 GB` |
| - Peak reserved GPU memory attributed to training: `29.706 GB` |
|
|
| ## Evaluation notes |
|
|
| - The notebook reports baseline `DeepSeek-OCR-2` performance of `23%` character error rate on one sample before fine-tuning. |
| - The recorded notebook run does not include a held-out validation or test benchmark after fine-tuning. |
| - Training loss decreased from `1.4462` at step 1 to `0.0281` at step 112. |
|
|
| ## Limitations |
|
|
| - This model is specialized for building regulation pages and may not transfer well to other document layouts or jurisdictions. |
| - The model is optimized for a fixed JSON schema, not general-purpose OCR or document QA. |
| - No separate evaluation split is documented in the notebook, so real-world accuracy should be validated on your own samples before deployment. |
| - Errors are more likely on low-quality scans, heavily rotated pages, partially cropped pages, handwriting, or unseen form variants. |
|
|
| ## Repository notes |
|
|
| - The notebook saved the model under `AyoubChLin/deepseek_ocr2_arabic_jsonify`. |
| - The Hub repository currently contains both adapter artifacts and merged model weights produced by the notebook save workflow. |
|
|
| ## Acknowledgements |
|
|
| - Base model: `deepseek-ai/DeepSeek-OCR-2` |
| - Fine-tuning workflow: Unsloth |
|
|