Update README.md
Browse files
README.md
CHANGED
|
@@ -1,21 +1,136 @@
|
|
| 1 |
---
|
| 2 |
base_model: deepseek-ai/DeepSeek-OCR-2
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
- transformers
|
| 6 |
-
- unsloth
|
| 7 |
-
- deepseek_vl_v2
|
| 8 |
license: apache-2.0
|
| 9 |
language:
|
| 10 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
-
|
| 16 |
-
-
|
| 17 |
-
- **Finetuned from model :** deepseek-ai/DeepSeek-OCR-2
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
base_model: deepseek-ai/DeepSeek-OCR-2
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
| 5 |
license: apache-2.0
|
| 6 |
language:
|
| 7 |
+
- ar
|
| 8 |
+
tags:
|
| 9 |
+
- ocr
|
| 10 |
+
- image-text-to-text
|
| 11 |
+
- vision-language
|
| 12 |
+
- document-understanding
|
| 13 |
+
- json-extraction
|
| 14 |
+
- arabic
|
| 15 |
+
- deepseek_vl_v2
|
| 16 |
+
- unsloth
|
| 17 |
+
- lora
|
| 18 |
+
- peft
|
| 19 |
---
|
| 20 |
|
| 21 |
+
# deepseek_ocr2_arabic_jsonify
|
| 22 |
+
|
| 23 |
+
`deepseek_ocr2_arabic_jsonify` is a task-specific fine-tune of `deepseek-ai/DeepSeek-OCR-2` for OCR-to-JSON extraction on building regulation pages. It is trained to read a single document page image and return one strict JSON object containing the page header fields and regulation table fields without extra explanation text.
|
| 24 |
+
|
| 25 |
+
The training workflow in the notebook loads the Unsloth-compatible `unsloth/DeepSeek-OCR-2` checkpoint, which maps to the same DeepSeek OCR 2 base model family, then fine-tunes it with LoRA for structured extraction.
|
| 26 |
+
|
| 27 |
+
## Intended use
|
| 28 |
+
|
| 29 |
+
- Extract structured data from scanned or photographed building regulation pages.
|
| 30 |
+
- Return JSON only.
|
| 31 |
+
- Preserve the original document language and values exactly when possible, especially Arabic text, numbers, punctuation, and line breaks.
|
| 32 |
+
- Use empty strings for missing or unreadable fields instead of hallucinating values.
|
| 33 |
+
|
| 34 |
+
## Output schema
|
| 35 |
+
|
| 36 |
+
The model was trained to produce this exact JSON structure and key order:
|
| 37 |
+
|
| 38 |
+
```json
|
| 39 |
+
{
|
| 40 |
+
"header": {
|
| 41 |
+
"municipality": "",
|
| 42 |
+
"district_name": "",
|
| 43 |
+
"plan_number": "",
|
| 44 |
+
"plot_number": "",
|
| 45 |
+
"block_number": "",
|
| 46 |
+
"division_area": ""
|
| 47 |
+
},
|
| 48 |
+
"table": {
|
| 49 |
+
"building_regulations": "",
|
| 50 |
+
"building_usage": "",
|
| 51 |
+
"setback": "",
|
| 52 |
+
"heights": "",
|
| 53 |
+
"building_factor": "",
|
| 54 |
+
"building_ratio": "",
|
| 55 |
+
"parking_requirements": "",
|
| 56 |
+
"notes": ""
|
| 57 |
+
}
|
| 58 |
+
}
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
## Prompt format
|
| 62 |
+
|
| 63 |
+
The notebook converts each sample into a 3-message conversation:
|
| 64 |
+
|
| 65 |
+
```json
|
| 66 |
+
[
|
| 67 |
+
{
|
| 68 |
+
"role": "<|System|>",
|
| 69 |
+
"content": "Extract only the header and table fields and return one valid JSON object."
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"role": "<|User|>",
|
| 73 |
+
"content": "<image>\n.",
|
| 74 |
+
"images": ["document-page-image"]
|
| 75 |
+
},
|
| 76 |
+
{
|
| 77 |
+
"role": "<|Assistant|>",
|
| 78 |
+
"content": "{...gold JSON...}"
|
| 79 |
+
}
|
| 80 |
+
]
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
The system instruction also enforces JSON-only output, original-language preservation, no extra keys, and empty-string fallback for missing fields.
|
| 84 |
+
|
| 85 |
+
## Training data
|
| 86 |
+
|
| 87 |
+
- Custom dataset of 108 document-page images paired with gold JSON extraction targets.
|
| 88 |
+
- Domain: Riyadh municipal building regulation pages.
|
| 89 |
+
- Source format: local `data.jsonl` with fields `image`, `text`, `transformed_text_to_json`, and `transformed_text_to_json_translated_to_English`.
|
| 90 |
+
- Training target: the `text` field, which contains the expected JSON output.
|
| 91 |
+
|
| 92 |
+
## Training details
|
| 93 |
+
|
| 94 |
+
- Base model: `deepseek-ai/DeepSeek-OCR-2`
|
| 95 |
+
- Fine-tuning framework: Unsloth with Hugging Face Transformers/TRL
|
| 96 |
+
- Hardware used for the recorded run: `NVIDIA A100-SXM4-40GB`
|
| 97 |
+
- Image settings: `image_size=1024`, `base_size=1024`, `crop_mode=True`
|
| 98 |
+
- LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
|
| 99 |
+
- LoRA config: `r=32`, `lora_alpha=64`, `lora_dropout=0`
|
| 100 |
+
- Precision: `bf16` when supported
|
| 101 |
+
- Per-device batch size: `2`
|
| 102 |
+
- Gradient accumulation steps: `4`
|
| 103 |
+
- Effective batch size: `8`
|
| 104 |
+
- Learning rate: `2e-4`
|
| 105 |
+
- Optimizer: `adamw_8bit`
|
| 106 |
+
- LR scheduler: `linear`
|
| 107 |
+
- Epochs in the recorded run: `8`
|
| 108 |
+
- Actual training steps in the recorded run: `112`
|
| 109 |
+
- Train on responses only: `True`
|
| 110 |
+
- Trainable parameters: `172,615,680 / 3,561,735,040` (`4.85%`)
|
| 111 |
+
- Training runtime: `1548.2392` seconds (`25.8` minutes)
|
| 112 |
+
- Peak reserved GPU memory: `39.686 GB`
|
| 113 |
+
- Peak reserved GPU memory attributed to training: `29.706 GB`
|
| 114 |
+
|
| 115 |
+
## Evaluation notes
|
| 116 |
+
|
| 117 |
+
- The notebook reports baseline `DeepSeek-OCR-2` performance of `23%` character error rate on one sample before fine-tuning.
|
| 118 |
+
- The recorded notebook run does not include a held-out validation or test benchmark after fine-tuning.
|
| 119 |
+
- Training loss decreased from `1.4462` at step 1 to `0.0281` at step 112.
|
| 120 |
+
|
| 121 |
+
## Limitations
|
| 122 |
+
|
| 123 |
+
- This model is specialized for building regulation pages and may not transfer well to other document layouts or jurisdictions.
|
| 124 |
+
- The model is optimized for a fixed JSON schema, not general-purpose OCR or document QA.
|
| 125 |
+
- No separate evaluation split is documented in the notebook, so real-world accuracy should be validated on your own samples before deployment.
|
| 126 |
+
- Errors are more likely on low-quality scans, heavily rotated pages, partially cropped pages, handwriting, or unseen form variants.
|
| 127 |
+
|
| 128 |
+
## Repository notes
|
| 129 |
|
| 130 |
+
- The notebook saved the model under `AyoubChLin/deepseek_ocr2_arabic_jsonify`.
|
| 131 |
+
- The Hub repository currently contains both adapter artifacts and merged model weights produced by the notebook save workflow.
|
|
|
|
| 132 |
|
| 133 |
+
## Acknowledgements
|
| 134 |
|
| 135 |
+
- Base model: `deepseek-ai/DeepSeek-OCR-2`
|
| 136 |
+
- Fine-tuning workflow: Unsloth
|