deepseek_ocr2_arabic_jsonify

deepseek_ocr2_arabic_jsonify is a task-specific fine-tune of deepseek-ai/DeepSeek-OCR-2 for OCR-to-JSON extraction on building regulation pages. It is trained to read a single document page image and return one strict JSON object containing the page header fields and regulation table fields without extra explanation text.

The training workflow in the notebook loads the Unsloth-compatible unsloth/DeepSeek-OCR-2 checkpoint, which maps to the same DeepSeek OCR 2 base model family, then fine-tunes it with LoRA for structured extraction.

Intended use

  • Extract structured data from scanned or photographed building regulation pages.
  • Return JSON only.
  • Preserve the original document language and values exactly when possible, especially Arabic text, numbers, punctuation, and line breaks.
  • Use empty strings for missing or unreadable fields instead of hallucinating values.

Output schema

The model was trained to produce this exact JSON structure and key order:

{
  "header": {
    "municipality": "",
    "district_name": "",
    "plan_number": "",
    "plot_number": "",
    "block_number": "",
    "division_area": ""
  },
  "table": {
    "building_regulations": "",
    "building_usage": "",
    "setback": "",
    "heights": "",
    "building_factor": "",
    "building_ratio": "",
    "parking_requirements": "",
    "notes": ""
  }
}

Prompt format

The notebook converts each sample into a 3-message conversation:

[
  {
    "role": "<|System|>",
    "content": "Extract only the header and table fields and return one valid JSON object."
  },
  {
    "role": "<|User|>",
    "content": "<image>\n.",
    "images": ["document-page-image"]
  },
  {
    "role": "<|Assistant|>",
    "content": "{...gold JSON...}"
  }
]

The system instruction also enforces JSON-only output, original-language preservation, no extra keys, and empty-string fallback for missing fields.

Training data

  • Custom dataset of 108 document-page images paired with gold JSON extraction targets.
  • Domain: Riyadh municipal building regulation pages.
  • Source format: local data.jsonl with fields image, text, transformed_text_to_json, and transformed_text_to_json_translated_to_English.
  • Training target: the text field, which contains the expected JSON output.

Training details

  • Base model: deepseek-ai/DeepSeek-OCR-2
  • Fine-tuning framework: Unsloth with Hugging Face Transformers/TRL
  • Hardware used for the recorded run: NVIDIA A100-SXM4-40GB
  • Image settings: image_size=1024, base_size=1024, crop_mode=True
  • LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • LoRA config: r=32, lora_alpha=64, lora_dropout=0
  • Precision: bf16 when supported
  • Per-device batch size: 2
  • Gradient accumulation steps: 4
  • Effective batch size: 8
  • Learning rate: 2e-4
  • Optimizer: adamw_8bit
  • LR scheduler: linear
  • Epochs in the recorded run: 8
  • Actual training steps in the recorded run: 112
  • Train on responses only: True
  • Trainable parameters: 172,615,680 / 3,561,735,040 (4.85%)
  • Training runtime: 1548.2392 seconds (25.8 minutes)
  • Peak reserved GPU memory: 39.686 GB
  • Peak reserved GPU memory attributed to training: 29.706 GB

Evaluation notes

  • The notebook reports baseline DeepSeek-OCR-2 performance of 23% character error rate on one sample before fine-tuning.
  • The recorded notebook run does not include a held-out validation or test benchmark after fine-tuning.
  • Training loss decreased from 1.4462 at step 1 to 0.0281 at step 112.

Limitations

  • This model is specialized for building regulation pages and may not transfer well to other document layouts or jurisdictions.
  • The model is optimized for a fixed JSON schema, not general-purpose OCR or document QA.
  • No separate evaluation split is documented in the notebook, so real-world accuracy should be validated on your own samples before deployment.
  • Errors are more likely on low-quality scans, heavily rotated pages, partially cropped pages, handwriting, or unseen form variants.

Repository notes

  • The notebook saved the model under AyoubChLin/deepseek_ocr2_arabic_jsonify.
  • The Hub repository currently contains both adapter artifacts and merged model weights produced by the notebook save workflow.

Acknowledgements

  • Base model: deepseek-ai/DeepSeek-OCR-2
  • Fine-tuning workflow: Unsloth
Downloads last month
130
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AyoubChLin/deepseek_ocr2_arabic_jsonify

Adapter
(2)
this model