File size: 10,753 Bytes
19343bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b49cb90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19343bf
b49cb90
 
19343bf
b49cb90
 
19343bf
b49cb90
 
 
 
 
 
 
 
 
19343bf
 
 
 
 
 
 
 
 
 
 
 
b49cb90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19343bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b49cb90
 
19343bf
 
 
 
 
 
 
b49cb90
 
 
 
 
 
 
 
 
19343bf
 
 
 
 
 
 
 
b49cb90
19343bf
 
 
 
b49cb90
19343bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b49cb90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
---
language:
- asm  # Assamese ISO 639-1 code
license: apache-2.0
base_model: microsoft/Florence-2-large-ft
tags:
- vision
- ocr
- assamese
- northeast-india
- indic-languages
- character-recognition
- florence-2
- vision-language
datasets:
- darknight054/indic-mozhi-ocr
metrics:
- accuracy
- character_error_rate
library_name: transformers
pipeline_tag: image-to-text

model-index:
- name: AssameseOCR
  results:
  - task:
      type: image-to-text
      name: Optical Character Recognition
    dataset:
      name: Mozhi Indic OCR (Assamese)
      type: darknight054/indic-mozhi-ocr
      config: assamese
      split: test
    metrics:
    - type: accuracy
      value: 94.67
      name: Character Accuracy
      verified: false
    - type: character_error_rate
      value: 5.33
      name: Character Error Rate (CER)
      verified: false

---

# AssameseOCR

**AssameseOCR** is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.

## Model Details

### Model Description

- **Developed by:** MWire Labs
- **Model type:** Vision-Language OCR
- **Language:** Assamese (অসমীয়া)
- **License:** Apache 2.0
- **Base Model:** microsoft/Florence-2-large-ft
- **Architecture:** Florence-2 Vision Encoder + Custom Transformer Decoder

### Model Architecture

```
Image (768×768) 

Florence-2 Vision Encoder (frozen, 360M params)

Vision Projection (1024 → 512 dim)

Transformer Decoder (4 layers, 8 heads)

Character-level predictions (187 vocab)
```

**Key Components:**
- **Vision Encoder:** Florence-2-large DaViT architecture (frozen)
- **Decoder:** 4-layer Transformer with 512 hidden dimensions
- **Tokenizer:** Character-level with 187 tokens (Assamese chars + English + digits + symbols)
- **Total Parameters:** 378M (361M frozen, 17.5M trainable)

## Training Details

### Training Data

- **Dataset:** [Mozhi Indic OCR Dataset](https://huggingface.co/datasets/darknight054/indic-mozhi-ocr) (Assamese subset)
- **Training samples:** 79,697 word images
- **Validation samples:** 9,945 word images
- **Test samples:** 10,146 word images
- **Source:** IIT Hyderabad CVIT

### Training Procedure

**Hardware:**
- GPU: NVIDIA A40 (48GB VRAM)
- Training time: ~8 hours (3 epochs)

**Hyperparameters:**
- Epochs: 3
- Batch size: 16
- Learning rate: 3e-4
- Optimizer: AdamW (weight_decay=0.01)
- Scheduler: CosineAnnealingLR
- Max sequence length: 128 characters
- Gradient clipping: 1.0

**Training Strategy:**
- Froze Florence-2 vision encoder (leveraging pretrained visual features)
- Trained only the projection layer and transformer decoder
- Full fine-tuning (no LoRA) for maximum quality

## Performance

### Results

| Split | Character Accuracy | Loss |
|-------|-------------------|------|
| Epoch 1 (Val) | 91.61% | 0.2844 |
| Epoch 2 (Val) | 94.09% | 0.1548 |
| Epoch 3 (Val) | **94.67%** | **0.1221** |

**Character Error Rate (CER):** ~5.33%

### Comparison

The model achieves strong performance for a foundation model approach:
- Mozhi paper (CRNN+CTC specialist): ~99% accuracy
- AssameseOCR (Florence generalist): 94.67% accuracy

The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:
- Extensibility to vision-language tasks (VQA, captioning, document understanding)
- Faster training (3 epochs vs typical 10-20 for CRNN)
- Foundation model benefits (transfer learning, robustness)

## Usage

### Installation

```bash
pip install torch torchvision transformers pillow
```

### Inference

```python
import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoModelForCausalLM, CLIPImageProcessor
from huggingface_hub import hf_hub_download
import json

# CharTokenizer class
class CharTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.char2id = {c: i for i, c in enumerate(vocab)}
        self.id2char = {i: c for i, c in enumerate(vocab)}
        self.pad_token_id = self.char2id["<pad>"]
        self.bos_token_id = self.char2id["<s>"]
        self.eos_token_id = self.char2id["</s>"]
        
    def encode(self, text, max_length=None, add_special_tokens=True):
        ids = [self.bos_token_id] if add_special_tokens else []
        for ch in text:
            ids.append(self.char2id.get(ch, self.char2id["<unk>"]))
        if add_special_tokens:
            ids.append(self.eos_token_id)
        if max_length:
            ids = ids[:max_length]
            if len(ids) < max_length:
                ids += [self.pad_token_id] * (max_length - len(ids))
        return ids
        
    def decode(self, ids, skip_special_tokens=True):
        chars = []
        for i in ids:
            ch = self.id2char.get(i, "")
            if skip_special_tokens and ch.startswith("<"):
                continue
            chars.append(ch)
        return "".join(chars)
    
    @classmethod
    def load(cls, path):
        with open(path, "r", encoding="utf-8") as f:
            vocab = json.load(f)
        return cls(vocab)

# FlorenceCharOCR model class
class FlorenceCharOCR(nn.Module):
    def __init__(self, florence_model, vocab_size, vision_hidden_dim, decoder_hidden_dim=512, num_layers=4):
        super().__init__()
        self.florence_model = florence_model
        
        for param in self.florence_model.parameters():
            param.requires_grad = False
        
        self.vision_proj = nn.Linear(vision_hidden_dim, decoder_hidden_dim)
        self.embedding = nn.Embedding(vocab_size, decoder_hidden_dim)
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=decoder_hidden_dim, 
            nhead=8, 
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(decoder_hidden_dim, vocab_size)
        
    def forward(self, pixel_values, tgt_ids, tgt_mask=None):
        with torch.no_grad():
            vision_feats = self.florence_model._encode_image(pixel_values)
        
        vision_feats = self.vision_proj(vision_feats)
        tgt_emb = self.embedding(tgt_ids)
        decoder_out = self.decoder(tgt_emb, vision_feats, tgt_mask=tgt_mask)
        logits = self.fc_out(decoder_out)
        
        return logits

# Load components
device = "cuda" if torch.cuda.is_available() else "cpu"

# Download files from HuggingFace
tokenizer_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_char_tokenizer.json")
model_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_ocr_best.pt")

# Load tokenizer
char_tokenizer = CharTokenizer.load(tokenizer_path)

# Load Florence base model
florence_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large-ft",
    trust_remote_code=True
).to(device)

# Load image processor
image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")

# Initialize OCR model
ocr_model = FlorenceCharOCR(
    florence_model=florence_model,
    vocab_size=len(char_tokenizer.vocab),
    vision_hidden_dim=1024,
    decoder_hidden_dim=512,
    num_layers=4
).to(device)

# Load trained weights
checkpoint = torch.load(model_path, map_location=device)
ocr_model.load_state_dict(checkpoint['model_state_dict'])
ocr_model.eval()

# Inference function
def recognize_text(image_path):
    # Load and process image
    image = Image.open(image_path).convert("RGB")
    pixel_values = image_processor(images=[image], return_tensors="pt")['pixel_values'].to(device)
    
    # Generate prediction
    with torch.no_grad():
        # Start with BOS token
        generated_ids = [char_tokenizer.bos_token_id]
        
        for _ in range(128):  # max length
            tgt_tensor = torch.tensor([generated_ids], device=device)
            logits = ocr_model(pixel_values, tgt_tensor)
            
            # Get next token
            next_token = logits[0, -1].argmax().item()
            generated_ids.append(next_token)
            
            # Stop if EOS
            if next_token == char_tokenizer.eos_token_id:
                break
    
    # Decode
    text = char_tokenizer.decode(generated_ids, skip_special_tokens=True)
    return text

# Example usage
result = recognize_text("assamese_text.jpg")
print(f"Recognized text: {result}")
```

## Vocabulary

The character-level tokenizer includes:
- **Assamese characters:** 119 unique chars (consonants, vowels, diacritics, conjuncts)
- **English:** 52 chars (a-z, A-Z)
- **Digits:** 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
- **Symbols:** 33 chars (punctuation, special chars)
- **Special tokens:** 6 tokens (`<pad>`, `<s>`, `</s>`, `<unk>`, `<OCR>`, `<lang_as>`)
- **Total vocabulary:** 187 tokens

## Limitations

- Trained only on printed text (not handwritten)
- Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
- Character-level decoder may struggle with very long sequences (>128 chars)
- Does not handle layout analysis or reading order
- Performance on degraded/low-quality images not extensively tested

## Future Work

- Extend to **MeiteiOCR** for Meitei Mayek script
- Scale to **NE-OCR** covering all 9+ Northeast Indian languages
- Add document layout analysis and reading order detection
- Improve performance with synthetic data augmentation
- Fine-tune for handwritten text recognition
- Extend to multimodal tasks (image captioning, VQA for documents)

## Citation

If you use AssameseOCR in your research, please cite:

```bibtex
@software{assameseocr2026,
  author = {MWire Labs},
  title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/MWirelabs/assamese-ocr}
}
```

## Acknowledgments

- **Dataset:** Mozhi Indic OCR Dataset by IIT Hyderabad CVIT ([Mathew et al., 2022](https://arxiv.org/abs/2205.06740))
- **Base Model:** Florence-2 by Microsoft Research
- **Organization:** MWire Labs, Shillong, Meghalaya, India

## Contact

- **Organization:** [MWire Labs](https://huggingface.co/MWirelabs)
- **Location:** Shillong, Meghalaya, India
- **Focus:** Language technology for Northeast Indian languages

---

**Part of the MWire Labs NLP suite:**
- [KhasiBERT](https://huggingface.co/MWirelabs/KhasiBERT-110M) - Khasi language model
- [NE-BERT](https://huggingface.co/MWirelabs/NE-BERT) - 9 Northeast languages
- [Kren-M](https://huggingface.co/MWirelabs/Kren-M) - Khasi-English conversational AI
- **AssameseOCR** - Assamese text recognition