OussamaBenSlama commited on
Commit
e36f209
·
verified ·
1 Parent(s): e20f45c

Update Readme file

Browse files
Files changed (1) hide show
  1. README.md +82 -22
README.md CHANGED
@@ -1,22 +1,82 @@
1
- ---
2
- base_model: unsloth/qwen2.5-vl-3b-instruct-bnb-4bit
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - qwen2_5_vl
8
- - trl
9
- license: apache-2.0
10
- language:
11
- - en
12
- ---
13
-
14
- # Uploaded model
15
-
16
- - **Developed by:** OussamaBenSlama
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** unsloth/qwen2.5-vl-3b-instruct-bnb-4bit
19
-
20
- This qwen2_5_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
21
-
22
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Alef-OCR-Image2Html
2
+
3
+ An Arabic OCR model that transforms document images,including historical texts, scanned pages, and handwritten materials—into structured and semantic HTML.
4
+
5
+ ### Key Features
6
+
7
+ - **Semantic HTML Output:** Generates structured HTML with semantic tags (section, header, main, footer, table, etc.)
8
+ - **Multi-format Support:** Handles various document types including historical manuscripts, newspaper articles, scientific papers, invoices, and more
9
+ - **Arabic-Optimized:** Fine-tuned specifically for Arabic text recognition and structure extraction
10
+ - **Zero-cost Training:** Developed using Kaggle's free tier computational resources
11
+
12
+ ## Model Architecture
13
+
14
+ - **Base Model:** Qwen2.5-VL-Instruct
15
+ - **Fine-tuning Method:** QLoRA with 4-bit quantization
16
+ - **LoRA Configuration:** Rank 16 applied to all modules
17
+ - **Optimization:** Unsloth for memory efficiency and training speed
18
+
19
+ ## Training Data
20
+
21
+ The model was trained on a custom dataset of **28K image-HTML pairs** consisting of:
22
+
23
+ - **46% Web-scraped content** (~13K samples): Arabic Wikipedia articles with cleaned semantic HTML
24
+ - **54% Synthetic data** (~15K samples): Generated documents mimicking ~13 real-world formats with diverse layouts and styles
25
+
26
+ For more details, see the [arabic-image2html dataset](https://huggingface.co/datasets/OussamaBenSlama/arabic-image2html).
27
+
28
+ ## Training Procedure
29
+
30
+ Training was performed in two stages:
31
+
32
+ **Stage 1:**
33
+ - Data: 40% of training dataset
34
+ - Learning rate: 5e-5
35
+ - LR scheduler: Linear
36
+
37
+ **Stage 2:**
38
+ - Data: 30% of training dataset (different split)
39
+ - Learning rate: 1e-5
40
+ - LR scheduler: Cosine
41
+
42
+
43
+ ## Performance
44
+
45
+ Evaluated by the NAMAA community on an anonymous benchmark dataset:
46
+
47
+ | Model | WER | CER | BLEU |
48
+ |-------|-----|-----|------|
49
+ | Alef-OCR-Image2Html | 0.92 | **0.72** | **0.19** |
50
+ | Qari-OCR-v0.3 (baseline) | **0.84** | 0.73 | 0.17 |
51
+
52
+ **Key Results:**
53
+ - Better Character Error Rate (CER): 0.72 vs 0.73
54
+ - Better BLEU Score: 0.19 vs 0.17
55
+ - Higher Word Error Rate (WER) due to limited diacritics handling in training data
56
+
57
+
58
+ ## Related Resources
59
+
60
+ - **Dataset:** [arabic-image2html](https://huggingface.co/datasets/OussamaBenSlama/arabic-image2html)
61
+ - **Training and Inference Notebooks:** [Available in the repository](https://github.com/OussamaBenSlama/Alef-OCR-Image2Html)
62
+
63
+ ## Citation
64
+
65
+ ```bibtex
66
+ @misc{alef_ocr_image2html_2025,
67
+ title={Alef-OCR-Image2Html: Arabic OCR to Semantic HTML},
68
+ author={Oussama Ben Slama},
69
+ year={2025},
70
+ howpublished={Hugging Face Models},
71
+ url={https://huggingface.co/OussamaBenSlama/Alef-OCR-Image2Html}
72
+ }
73
+ ```
74
+
75
+ ## Acknowledgments
76
+
77
+ This work builds upon:
78
+ - The NAMAA community's state-of-the-art Qari-OCR model
79
+
80
+ ## License
81
+
82
+ Apache2.0