README.md · AyoubChLin/deepseek_ocr2_arabic

File size: 4,763 Bytes

---
base_model: deepseek-ai/DeepSeek-OCR-2
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
language:
- ar
tags:
- ocr
- image-text-to-text
- vision-language
- document-understanding
- json-extraction
- arabic
- deepseek_vl_v2
- unsloth
- lora
- peft
---

# deepseek_ocr2_arabic_jsonify

`deepseek_ocr2_arabic_jsonify` is a task-specific fine-tune of `deepseek-ai/DeepSeek-OCR-2` for OCR-to-JSON extraction on building regulation pages. It is trained to read a single document page image and return one strict JSON object containing the page header fields and regulation table fields without extra explanation text.

The training workflow in the notebook loads the Unsloth-compatible `unsloth/DeepSeek-OCR-2` checkpoint, which maps to the same DeepSeek OCR 2 base model family, then fine-tunes it with LoRA for structured extraction.

## Intended use

- Extract structured data from scanned or photographed building regulation pages.
- Return JSON only.
- Preserve the original document language and values exactly when possible, especially Arabic text, numbers, punctuation, and line breaks.
- Use empty strings for missing or unreadable fields instead of hallucinating values.

## Output schema

The model was trained to produce this exact JSON structure and key order:

```json
{
  "header": {
    "municipality": "",
    "district_name": "",
    "plan_number": "",
    "plot_number": "",
    "block_number": "",
    "division_area": ""
  },
  "table": {
    "building_regulations": "",
    "building_usage": "",
    "setback": "",
    "heights": "",
    "building_factor": "",
    "building_ratio": "",
    "parking_requirements": "",
    "notes": ""
  }
}
```

## Prompt format

The notebook converts each sample into a 3-message conversation:

```json
[
  {
    "role": "<|System|>",
    "content": "Extract only the header and table fields and return one valid JSON object."
  },
  {
    "role": "<|User|>",
    "content": "<image>\n.",
    "images": ["document-page-image"]
  },
  {
    "role": "<|Assistant|>",
    "content": "{...gold JSON...}"
  }
]
```

The system instruction also enforces JSON-only output, original-language preservation, no extra keys, and empty-string fallback for missing fields.

## Training data

- Custom dataset of 108 document-page images paired with gold JSON extraction targets.
- Domain: Riyadh municipal building regulation pages.
- Source format: local `data.jsonl` with fields `image`, `text`, `transformed_text_to_json`, and `transformed_text_to_json_translated_to_English`.
- Training target: the `text` field, which contains the expected JSON output.

## Training details

- Base model: `deepseek-ai/DeepSeek-OCR-2`
- Fine-tuning framework: Unsloth with Hugging Face Transformers/TRL
- Hardware used for the recorded run: `NVIDIA A100-SXM4-40GB`
- Image settings: `image_size=1024`, `base_size=1024`, `crop_mode=True`
- LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
- LoRA config: `r=32`, `lora_alpha=64`, `lora_dropout=0`
- Precision: `bf16` when supported
- Per-device batch size: `2`
- Gradient accumulation steps: `4`
- Effective batch size: `8`
- Learning rate: `2e-4`
- Optimizer: `adamw_8bit`
- LR scheduler: `linear`
- Epochs in the recorded run: `8`
- Actual training steps in the recorded run: `112`
- Train on responses only: `True`
- Trainable parameters: `172,615,680 / 3,561,735,040` (`4.85%`)
- Training runtime: `1548.2392` seconds (`25.8` minutes)
- Peak reserved GPU memory: `39.686 GB`
- Peak reserved GPU memory attributed to training: `29.706 GB`

## Evaluation notes

- The notebook reports baseline `DeepSeek-OCR-2` performance of `23%` character error rate on one sample before fine-tuning.
- The recorded notebook run does not include a held-out validation or test benchmark after fine-tuning.
- Training loss decreased from `1.4462` at step 1 to `0.0281` at step 112.

## Limitations

- This model is specialized for building regulation pages and may not transfer well to other document layouts or jurisdictions.
- The model is optimized for a fixed JSON schema, not general-purpose OCR or document QA.
- No separate evaluation split is documented in the notebook, so real-world accuracy should be validated on your own samples before deployment.
- Errors are more likely on low-quality scans, heavily rotated pages, partially cropped pages, handwriting, or unseen form variants.

## Repository notes

- The notebook saved the model under `AyoubChLin/deepseek_ocr2_arabic_jsonify`.
- The Hub repository currently contains both adapter artifacts and merged model weights produced by the notebook save workflow.

## Acknowledgements

- Base model: `deepseek-ai/DeepSeek-OCR-2`
- Fine-tuning workflow: Unsloth