Update README.md

Browse files

Files changed (1) hide show

README.md +127 -12

README.md CHANGED Viewed

@@ -1,21 +1,136 @@
 ---
 base_model: deepseek-ai/DeepSeek-OCR-2
-tags:
-- text-generation-inference
-- transformers
-- unsloth
-- deepseek_vl_v2
 license: apache-2.0
 language:
-- en
 ---
-# Uploaded finetuned  model
-- **Developed by:** AyoubChLin
-- **License:** apache-2.0
-- **Finetuned from model :** deepseek-ai/DeepSeek-OCR-2
-This deepseek_vl_v2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
 base_model: deepseek-ai/DeepSeek-OCR-2
+library_name: transformers
+pipeline_tag: image-text-to-text
 license: apache-2.0
 language:
+- ar
+tags:
+- ocr
+- image-text-to-text
+- vision-language
+- document-understanding
+- json-extraction
+- arabic
+- deepseek_vl_v2
+- unsloth
+- lora
+- peft
 ---
+# deepseek_ocr2_arabic_jsonify
+`deepseek_ocr2_arabic_jsonify` is a task-specific fine-tune of `deepseek-ai/DeepSeek-OCR-2` for OCR-to-JSON extraction on building regulation pages. It is trained to read a single document page image and return one strict JSON object containing the page header fields and regulation table fields without extra explanation text.
+The training workflow in the notebook loads the Unsloth-compatible `unsloth/DeepSeek-OCR-2` checkpoint, which maps to the same DeepSeek OCR 2 base model family, then fine-tunes it with LoRA for structured extraction.
+## Intended use
+- Extract structured data from scanned or photographed building regulation pages.
+- Return JSON only.
+- Preserve the original document language and values exactly when possible, especially Arabic text, numbers, punctuation, and line breaks.
+- Use empty strings for missing or unreadable fields instead of hallucinating values.
+## Output schema
+The model was trained to produce this exact JSON structure and key order:
+```json
+{
+  "header": {
+    "municipality": "",
+    "district_name": "",
+    "plan_number": "",
+    "plot_number": "",
+    "block_number": "",
+    "division_area": ""
+  },
+  "table": {
+    "building_regulations": "",
+    "building_usage": "",
+    "setback": "",
+    "heights": "",
+    "building_factor": "",
+    "building_ratio": "",
+    "parking_requirements": "",
+    "notes": ""
+  }
+}
+```
+## Prompt format
+The notebook converts each sample into a 3-message conversation:
+```json
+[
+  {
+    "role": "<|System|>",
+    "content": "Extract only the header and table fields and return one valid JSON object."
+  },
+  {
+    "role": "<|User|>",
+    "content": "<image>\n.",
+    "images": ["document-page-image"]
+  },
+  {
+    "role": "<|Assistant|>",
+    "content": "{...gold JSON...}"
+  }
+]
+```
+The system instruction also enforces JSON-only output, original-language preservation, no extra keys, and empty-string fallback for missing fields.
+## Training data
+- Custom dataset of 108 document-page images paired with gold JSON extraction targets.
+- Domain: Riyadh municipal building regulation pages.
+- Source format: local `data.jsonl` with fields `image`, `text`, `transformed_text_to_json`, and `transformed_text_to_json_translated_to_English`.
+- Training target: the `text` field, which contains the expected JSON output.
+## Training details
+- Base model: `deepseek-ai/DeepSeek-OCR-2`
+- Fine-tuning framework: Unsloth with Hugging Face Transformers/TRL
+- Hardware used for the recorded run: `NVIDIA A100-SXM4-40GB`
+- Image settings: `image_size=1024`, `base_size=1024`, `crop_mode=True`
+- LoRA target modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
+- LoRA config: `r=32`, `lora_alpha=64`, `lora_dropout=0`
+- Precision: `bf16` when supported
+- Per-device batch size: `2`
+- Gradient accumulation steps: `4`
+- Effective batch size: `8`
+- Learning rate: `2e-4`
+- Optimizer: `adamw_8bit`
+- LR scheduler: `linear`
+- Epochs in the recorded run: `8`
+- Actual training steps in the recorded run: `112`
+- Train on responses only: `True`
+- Trainable parameters: `172,615,680 / 3,561,735,040` (`4.85%`)
+- Training runtime: `1548.2392` seconds (`25.8` minutes)
+- Peak reserved GPU memory: `39.686 GB`
+- Peak reserved GPU memory attributed to training: `29.706 GB`
+## Evaluation notes
+- The notebook reports baseline `DeepSeek-OCR-2` performance of `23%` character error rate on one sample before fine-tuning.
+- The recorded notebook run does not include a held-out validation or test benchmark after fine-tuning.
+- Training loss decreased from `1.4462` at step 1 to `0.0281` at step 112.
+## Limitations
+- This model is specialized for building regulation pages and may not transfer well to other document layouts or jurisdictions.
+- The model is optimized for a fixed JSON schema, not general-purpose OCR or document QA.
+- No separate evaluation split is documented in the notebook, so real-world accuracy should be validated on your own samples before deployment.
+- Errors are more likely on low-quality scans, heavily rotated pages, partially cropped pages, handwriting, or unseen form variants.
+## Repository notes
+- The notebook saved the model under `AyoubChLin/deepseek_ocr2_arabic_jsonify`.
+- The Hub repository currently contains both adapter artifacts and merged model weights produced by the notebook save workflow.
+## Acknowledgements
+- Base model: `deepseek-ai/DeepSeek-OCR-2`
+- Fine-tuning workflow: Unsloth