AyoubChLin commited on
Commit
8bdad59
·
verified ·
1 Parent(s): b5a5627

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -12
README.md CHANGED
@@ -1,21 +1,136 @@
1
  ---
2
  base_model: deepseek-ai/DeepSeek-OCR-2
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - deepseek_vl_v2
8
  license: apache-2.0
9
  language:
10
- - en
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- - **Developed by:** AyoubChLin
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** deepseek-ai/DeepSeek-OCR-2
18
 
19
- This deepseek_vl_v2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
1
  ---
2
  base_model: deepseek-ai/DeepSeek-OCR-2
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
 
 
 
5
  license: apache-2.0
6
  language:
7
+ - ar
8
+ tags:
9
+ - ocr
10
+ - image-text-to-text
11
+ - vision-language
12
+ - document-understanding
13
+ - json-extraction
14
+ - arabic
15
+ - deepseek_vl_v2
16
+ - unsloth
17
+ - lora
18
+ - peft
19
  ---
20
 
21
+ # deepseek_ocr2_arabic_jsonify
22
+
23
+ `deepseek_ocr2_arabic_jsonify` is a task-specific fine-tune of `deepseek-ai/DeepSeek-OCR-2` for OCR-to-JSON extraction on building regulation pages. It is trained to read a single document page image and return one strict JSON object containing the page header fields and regulation table fields without extra explanation text.
24
+
25
+ The training workflow in the notebook loads the Unsloth-compatible `unsloth/DeepSeek-OCR-2` checkpoint, which maps to the same DeepSeek OCR 2 base model family, then fine-tunes it with LoRA for structured extraction.
26
+
27
+ ## Intended use
28
+
29
+ - Extract structured data from scanned or photographed building regulation pages.
30
+ - Return JSON only.
31
+ - Preserve the original document language and values exactly when possible, especially Arabic text, numbers, punctuation, and line breaks.
32
+ - Use empty strings for missing or unreadable fields instead of hallucinating values.
33
+
34
+ ## Output schema
35
+
36
+ The model was trained to produce this exact JSON structure and key order:
37
+
38
+ ```json
39
+ {
40
+ "header": {
41
+ "municipality": "",
42
+ "district_name": "",
43
+ "plan_number": "",
44
+ "plot_number": "",
45
+ "block_number": "",
46
+ "division_area": ""
47
+ },
48
+ "table": {
49
+ "building_regulations": "",
50
+ "building_usage": "",
51
+ "setback": "",
52
+ "heights": "",
53
+ "building_factor": "",
54
+ "building_ratio": "",
55
+ "parking_requirements": "",
56
+ "notes": ""
57
+ }
58
+ }
59
+ ```
60
+
61
+ ## Prompt format
62
+
63
+ The notebook converts each sample into a 3-message conversation:
64
+
65
+ ```json
66
+ [
67
+ {
68
+ "role": "<|System|>",
69
+ "content": "Extract only the header and table fields and return one valid JSON object."
70
+ },
71
+ {
72
+ "role": "<|User|>",
73
+ "content": "<image>\n.",
74
+ "images": ["document-page-image"]
75
+ },
76
+ {
77
+ "role": "<|Assistant|>",
78
+ "content": "{...gold JSON...}"
79
+ }
80
+ ]
81
+ ```
82
+
83
+ The system instruction also enforces JSON-only output, original-language preservation, no extra keys, and empty-string fallback for missing fields.
84
+
85
+ ## Training data
86
+
87
+ - Custom dataset of 108 document-page images paired with gold JSON extraction targets.
88
+ - Domain: Riyadh municipal building regulation pages.
89
+ - Source format: local `data.jsonl` with fields `image`, `text`, `transformed_text_to_json`, and `transformed_text_to_json_translated_to_English`.
90
+ - Training target: the `text` field, which contains the expected JSON output.
91
+
92
+ ## Training details
93
+
94
+ - Base model: `deepseek-ai/DeepSeek-OCR-2`
95
+ - Fine-tuning framework: Unsloth with Hugging Face Transformers/TRL
96
+ - Hardware used for the recorded run: `NVIDIA A100-SXM4-40GB`
97
+ - Image settings: `image_size=1024`, `base_size=1024`, `crop_mode=True`
98
+ - LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
99
+ - LoRA config: `r=32`, `lora_alpha=64`, `lora_dropout=0`
100
+ - Precision: `bf16` when supported
101
+ - Per-device batch size: `2`
102
+ - Gradient accumulation steps: `4`
103
+ - Effective batch size: `8`
104
+ - Learning rate: `2e-4`
105
+ - Optimizer: `adamw_8bit`
106
+ - LR scheduler: `linear`
107
+ - Epochs in the recorded run: `8`
108
+ - Actual training steps in the recorded run: `112`
109
+ - Train on responses only: `True`
110
+ - Trainable parameters: `172,615,680 / 3,561,735,040` (`4.85%`)
111
+ - Training runtime: `1548.2392` seconds (`25.8` minutes)
112
+ - Peak reserved GPU memory: `39.686 GB`
113
+ - Peak reserved GPU memory attributed to training: `29.706 GB`
114
+
115
+ ## Evaluation notes
116
+
117
+ - The notebook reports baseline `DeepSeek-OCR-2` performance of `23%` character error rate on one sample before fine-tuning.
118
+ - The recorded notebook run does not include a held-out validation or test benchmark after fine-tuning.
119
+ - Training loss decreased from `1.4462` at step 1 to `0.0281` at step 112.
120
+
121
+ ## Limitations
122
+
123
+ - This model is specialized for building regulation pages and may not transfer well to other document layouts or jurisdictions.
124
+ - The model is optimized for a fixed JSON schema, not general-purpose OCR or document QA.
125
+ - No separate evaluation split is documented in the notebook, so real-world accuracy should be validated on your own samples before deployment.
126
+ - Errors are more likely on low-quality scans, heavily rotated pages, partially cropped pages, handwriting, or unseen form variants.
127
+
128
+ ## Repository notes
129
 
130
+ - The notebook saved the model under `AyoubChLin/deepseek_ocr2_arabic_jsonify`.
131
+ - The Hub repository currently contains both adapter artifacts and merged model weights produced by the notebook save workflow.
 
132
 
133
+ ## Acknowledgements
134
 
135
+ - Base model: `deepseek-ai/DeepSeek-OCR-2`
136
+ - Fine-tuning workflow: Unsloth