Khrawsynth commited on
Commit
b49cb90
·
verified ·
1 Parent(s): be9b223

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +215 -0
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AssameseOCR
2
+
3
+ **AssameseOCR** is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.
4
+
5
+ ## Model Details
6
+
7
+ ### Model Description
8
+
9
+ - **Developed by:** MWire Labs
10
+ - **Model type:** Vision-Language OCR
11
+ - **Language:** Assamese (অসমীয়া)
12
+ - **License:** Apache 2.0
13
+ - **Base Model:** microsoft/Florence-2-large-ft
14
+ - **Architecture:** Florence-2 Vision Encoder + Custom Transformer Decoder
15
+
16
+ ### Model Architecture
17
+
18
+ ```
19
+ Image (768×768)
20
+
21
+ Florence-2 Vision Encoder (frozen, 360M params)
22
+
23
+ Vision Projection (1024 → 512 dim)
24
+
25
+ Transformer Decoder (4 layers, 8 heads)
26
+
27
+ Character-level predictions (187 vocab)
28
+ ```
29
+
30
+ **Key Components:**
31
+ - **Vision Encoder:** Florence-2-large DaViT architecture (frozen)
32
+ - **Decoder:** 4-layer Transformer with 512 hidden dimensions
33
+ - **Tokenizer:** Character-level with 187 tokens (Assamese chars + English + digits + symbols)
34
+ - **Total Parameters:** 378M (361M frozen, 17.5M trainable)
35
+
36
+ ## Training Details
37
+
38
+ ### Training Data
39
+
40
+ - **Dataset:** [Mozhi Indic OCR Dataset](https://huggingface.co/datasets/darknight054/indic-mozhi-ocr) (Assamese subset)
41
+ - **Training samples:** 79,697 word images
42
+ - **Validation samples:** 9,945 word images
43
+ - **Test samples:** 10,146 word images
44
+ - **Source:** IIT Hyderabad CVIT
45
+
46
+ ### Training Procedure
47
+
48
+ **Hardware:**
49
+ - GPU: NVIDIA A40 (48GB VRAM)
50
+ - Training time: ~8 hours (3 epochs)
51
+
52
+ **Hyperparameters:**
53
+ - Epochs: 3
54
+ - Batch size: 16
55
+ - Learning rate: 3e-4
56
+ - Optimizer: AdamW (weight_decay=0.01)
57
+ - Scheduler: CosineAnnealingLR
58
+ - Max sequence length: 128 characters
59
+ - Gradient clipping: 1.0
60
+
61
+ **Training Strategy:**
62
+ - Froze Florence-2 vision encoder (leveraging pretrained visual features)
63
+ - Trained only the projection layer and transformer decoder
64
+ - Full fine-tuning (no LoRA) for maximum quality
65
+
66
+ ## Performance
67
+
68
+ ### Results
69
+
70
+ | Split | Character Accuracy | Loss |
71
+ |-------|-------------------|------|
72
+ | Epoch 1 (Val) | 91.61% | 0.2844 |
73
+ | Epoch 2 (Val) | 94.09% | 0.1548 |
74
+ | Epoch 3 (Val) | **94.67%** | **0.1221** |
75
+
76
+ **Character Error Rate (CER):** ~5.33%
77
+
78
+ ### Comparison
79
+
80
+ The model achieves strong performance for a foundation model approach:
81
+ - Mozhi paper (CRNN+CTC specialist): ~99% accuracy
82
+ - AssameseOCR (Florence generalist): 94.67% accuracy
83
+
84
+ The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:
85
+ - Extensibility to vision-language tasks (VQA, captioning, document understanding)
86
+ - Faster training (3 epochs vs typical 10-20 for CRNN)
87
+ - Foundation model benefits (transfer learning, robustness)
88
+
89
+ ## Usage
90
+
91
+ ### Installation
92
+
93
+ ```bash
94
+ pip install torch torchvision transformers pillow
95
+ ```
96
+
97
+ ### Inference
98
+
99
+ ```python
100
+ import torch
101
+ from PIL import Image
102
+ from transformers import AutoModelForCausalLM, CLIPImageProcessor
103
+ import json
104
+
105
+ # Load tokenizer
106
+ class CharTokenizer:
107
+ def __init__(self, vocab):
108
+ self.vocab = vocab
109
+ self.char2id = {c: i for i, c in enumerate(vocab)}
110
+ self.id2char = {i: c for i, c in enumerate(vocab)}
111
+ self.pad_token_id = self.char2id["<pad>"]
112
+ self.bos_token_id = self.char2id["<s>"]
113
+ self.eos_token_id = self.char2id["</s>"]
114
+
115
+ def decode(self, ids, skip_special_tokens=True):
116
+ chars = []
117
+ for i in ids:
118
+ ch = self.id2char.get(i, "")
119
+ if skip_special_tokens and ch.startswith("<"):
120
+ continue
121
+ chars.append(ch)
122
+ return "".join(chars)
123
+
124
+ @classmethod
125
+ def load(cls, path):
126
+ with open(path, "r", encoding="utf-8") as f:
127
+ vocab = json.load(f)
128
+ return cls(vocab)
129
+
130
+ # Load model components
131
+ device = "cuda" if torch.cuda.is_available() else "cpu"
132
+
133
+ # Load Florence base model
134
+ florence_model = AutoModelForCausalLM.from_pretrained(
135
+ "microsoft/Florence-2-large-ft",
136
+ trust_remote_code=True
137
+ ).to(device)
138
+
139
+ # Load image processor
140
+ image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")
141
+
142
+ # Load tokenizer
143
+ char_tokenizer = CharTokenizer.load("assamese_char_tokenizer.json")
144
+
145
+ # Load AssameseOCR weights
146
+ # (Note: You'll need to define the FlorenceCharOCR class as in training)
147
+ checkpoint = torch.load("assamese_ocr_best.pt", map_location=device)
148
+ # ocr_model.load_state_dict(checkpoint['model_state_dict'])
149
+
150
+ # Inference
151
+ image = Image.open("assamese_text.jpg")
152
+ # Process and predict...
153
+ ```
154
+
155
+ ## Vocabulary
156
+
157
+ The character-level tokenizer includes:
158
+ - **Assamese characters:** 119 unique chars (consonants, vowels, diacritics, conjuncts)
159
+ - **English:** 52 chars (a-z, A-Z)
160
+ - **Digits:** 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
161
+ - **Symbols:** 33 chars (punctuation, special chars)
162
+ - **Special tokens:** 6 tokens (`<pad>`, `<s>`, `</s>`, `<unk>`, `<OCR>`, `<lang_as>`)
163
+ - **Total vocabulary:** 187 tokens
164
+
165
+ ## Limitations
166
+
167
+ - Trained only on printed text (not handwritten)
168
+ - Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
169
+ - Character-level decoder may struggle with very long sequences (>128 chars)
170
+ - Does not handle layout analysis or reading order
171
+ - Performance on degraded/low-quality images not extensively tested
172
+
173
+ ## Future Work
174
+
175
+ - Extend to **MeiteiOCR** for Meitei Mayek script
176
+ - Scale to **NE-OCR** covering all 9+ Northeast Indian languages
177
+ - Add document layout analysis and reading order detection
178
+ - Improve performance with synthetic data augmentation
179
+ - Fine-tune for handwritten text recognition
180
+ - Extend to multimodal tasks (image captioning, VQA for documents)
181
+
182
+ ## Citation
183
+
184
+ If you use AssameseOCR in your research, please cite:
185
+
186
+ ```bibtex
187
+ @software{assameseocr2026,
188
+ author = {MWire Labs},
189
+ title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
190
+ year = {2026},
191
+ publisher = {Hugging Face},
192
+ url = {https://huggingface.co/MWirelabs/assamese-ocr}
193
+ }
194
+ ```
195
+
196
+ ## Acknowledgments
197
+
198
+ - **Dataset:** Mozhi Indic OCR Dataset by IIT Hyderabad CVIT ([Mathew et al., 2022](https://arxiv.org/abs/2205.06740))
199
+ - **Base Model:** Florence-2 by Microsoft Research
200
+ - **Organization:** MWire Labs, Shillong, Meghalaya, India
201
+
202
+ ## Contact
203
+
204
+ - **Organization:** [MWire Labs](https://huggingface.co/MWirelabs)
205
+ - **Location:** Shillong, Meghalaya, India
206
+ - **Focus:** Language technology for Northeast Indian languages
207
+
208
+ ---
209
+
210
+ **Part of the MWire Labs NLP suite:**
211
+ - [KhasiBERT](https://huggingface.co/MWirelabs/KhasiBERT-110M) - Khasi language model
212
+ - [NE-BERT](https://huggingface.co/MWirelabs/NE-BERT) - 9 Northeast languages
213
+ - [Kren-M](https://huggingface.co/MWirelabs/Kren-M) - Khasi-English conversational AI
214
+
215
+ - **AssameseOCR** - Assamese text recognition