File size: 10,753 Bytes
19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 19343bf b49cb90 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 |
---
language:
- asm # Assamese ISO 639-1 code
license: apache-2.0
base_model: microsoft/Florence-2-large-ft
tags:
- vision
- ocr
- assamese
- northeast-india
- indic-languages
- character-recognition
- florence-2
- vision-language
datasets:
- darknight054/indic-mozhi-ocr
metrics:
- accuracy
- character_error_rate
library_name: transformers
pipeline_tag: image-to-text
model-index:
- name: AssameseOCR
results:
- task:
type: image-to-text
name: Optical Character Recognition
dataset:
name: Mozhi Indic OCR (Assamese)
type: darknight054/indic-mozhi-ocr
config: assamese
split: test
metrics:
- type: accuracy
value: 94.67
name: Character Accuracy
verified: false
- type: character_error_rate
value: 5.33
name: Character Error Rate (CER)
verified: false
---
# AssameseOCR
**AssameseOCR** is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.
## Model Details
### Model Description
- **Developed by:** MWire Labs
- **Model type:** Vision-Language OCR
- **Language:** Assamese (অসমীয়া)
- **License:** Apache 2.0
- **Base Model:** microsoft/Florence-2-large-ft
- **Architecture:** Florence-2 Vision Encoder + Custom Transformer Decoder
### Model Architecture
```
Image (768×768)
↓
Florence-2 Vision Encoder (frozen, 360M params)
↓
Vision Projection (1024 → 512 dim)
↓
Transformer Decoder (4 layers, 8 heads)
↓
Character-level predictions (187 vocab)
```
**Key Components:**
- **Vision Encoder:** Florence-2-large DaViT architecture (frozen)
- **Decoder:** 4-layer Transformer with 512 hidden dimensions
- **Tokenizer:** Character-level with 187 tokens (Assamese chars + English + digits + symbols)
- **Total Parameters:** 378M (361M frozen, 17.5M trainable)
## Training Details
### Training Data
- **Dataset:** [Mozhi Indic OCR Dataset](https://huggingface.co/datasets/darknight054/indic-mozhi-ocr) (Assamese subset)
- **Training samples:** 79,697 word images
- **Validation samples:** 9,945 word images
- **Test samples:** 10,146 word images
- **Source:** IIT Hyderabad CVIT
### Training Procedure
**Hardware:**
- GPU: NVIDIA A40 (48GB VRAM)
- Training time: ~8 hours (3 epochs)
**Hyperparameters:**
- Epochs: 3
- Batch size: 16
- Learning rate: 3e-4
- Optimizer: AdamW (weight_decay=0.01)
- Scheduler: CosineAnnealingLR
- Max sequence length: 128 characters
- Gradient clipping: 1.0
**Training Strategy:**
- Froze Florence-2 vision encoder (leveraging pretrained visual features)
- Trained only the projection layer and transformer decoder
- Full fine-tuning (no LoRA) for maximum quality
## Performance
### Results
| Split | Character Accuracy | Loss |
|-------|-------------------|------|
| Epoch 1 (Val) | 91.61% | 0.2844 |
| Epoch 2 (Val) | 94.09% | 0.1548 |
| Epoch 3 (Val) | **94.67%** | **0.1221** |
**Character Error Rate (CER):** ~5.33%
### Comparison
The model achieves strong performance for a foundation model approach:
- Mozhi paper (CRNN+CTC specialist): ~99% accuracy
- AssameseOCR (Florence generalist): 94.67% accuracy
The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:
- Extensibility to vision-language tasks (VQA, captioning, document understanding)
- Faster training (3 epochs vs typical 10-20 for CRNN)
- Foundation model benefits (transfer learning, robustness)
## Usage
### Installation
```bash
pip install torch torchvision transformers pillow
```
### Inference
```python
import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoModelForCausalLM, CLIPImageProcessor
from huggingface_hub import hf_hub_download
import json
# CharTokenizer class
class CharTokenizer:
def __init__(self, vocab):
self.vocab = vocab
self.char2id = {c: i for i, c in enumerate(vocab)}
self.id2char = {i: c for i, c in enumerate(vocab)}
self.pad_token_id = self.char2id["<pad>"]
self.bos_token_id = self.char2id["<s>"]
self.eos_token_id = self.char2id["</s>"]
def encode(self, text, max_length=None, add_special_tokens=True):
ids = [self.bos_token_id] if add_special_tokens else []
for ch in text:
ids.append(self.char2id.get(ch, self.char2id["<unk>"]))
if add_special_tokens:
ids.append(self.eos_token_id)
if max_length:
ids = ids[:max_length]
if len(ids) < max_length:
ids += [self.pad_token_id] * (max_length - len(ids))
return ids
def decode(self, ids, skip_special_tokens=True):
chars = []
for i in ids:
ch = self.id2char.get(i, "")
if skip_special_tokens and ch.startswith("<"):
continue
chars.append(ch)
return "".join(chars)
@classmethod
def load(cls, path):
with open(path, "r", encoding="utf-8") as f:
vocab = json.load(f)
return cls(vocab)
# FlorenceCharOCR model class
class FlorenceCharOCR(nn.Module):
def __init__(self, florence_model, vocab_size, vision_hidden_dim, decoder_hidden_dim=512, num_layers=4):
super().__init__()
self.florence_model = florence_model
for param in self.florence_model.parameters():
param.requires_grad = False
self.vision_proj = nn.Linear(vision_hidden_dim, decoder_hidden_dim)
self.embedding = nn.Embedding(vocab_size, decoder_hidden_dim)
decoder_layer = nn.TransformerDecoderLayer(
d_model=decoder_hidden_dim,
nhead=8,
batch_first=True
)
self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
self.fc_out = nn.Linear(decoder_hidden_dim, vocab_size)
def forward(self, pixel_values, tgt_ids, tgt_mask=None):
with torch.no_grad():
vision_feats = self.florence_model._encode_image(pixel_values)
vision_feats = self.vision_proj(vision_feats)
tgt_emb = self.embedding(tgt_ids)
decoder_out = self.decoder(tgt_emb, vision_feats, tgt_mask=tgt_mask)
logits = self.fc_out(decoder_out)
return logits
# Load components
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download files from HuggingFace
tokenizer_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_char_tokenizer.json")
model_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_ocr_best.pt")
# Load tokenizer
char_tokenizer = CharTokenizer.load(tokenizer_path)
# Load Florence base model
florence_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-large-ft",
trust_remote_code=True
).to(device)
# Load image processor
image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")
# Initialize OCR model
ocr_model = FlorenceCharOCR(
florence_model=florence_model,
vocab_size=len(char_tokenizer.vocab),
vision_hidden_dim=1024,
decoder_hidden_dim=512,
num_layers=4
).to(device)
# Load trained weights
checkpoint = torch.load(model_path, map_location=device)
ocr_model.load_state_dict(checkpoint['model_state_dict'])
ocr_model.eval()
# Inference function
def recognize_text(image_path):
# Load and process image
image = Image.open(image_path).convert("RGB")
pixel_values = image_processor(images=[image], return_tensors="pt")['pixel_values'].to(device)
# Generate prediction
with torch.no_grad():
# Start with BOS token
generated_ids = [char_tokenizer.bos_token_id]
for _ in range(128): # max length
tgt_tensor = torch.tensor([generated_ids], device=device)
logits = ocr_model(pixel_values, tgt_tensor)
# Get next token
next_token = logits[0, -1].argmax().item()
generated_ids.append(next_token)
# Stop if EOS
if next_token == char_tokenizer.eos_token_id:
break
# Decode
text = char_tokenizer.decode(generated_ids, skip_special_tokens=True)
return text
# Example usage
result = recognize_text("assamese_text.jpg")
print(f"Recognized text: {result}")
```
## Vocabulary
The character-level tokenizer includes:
- **Assamese characters:** 119 unique chars (consonants, vowels, diacritics, conjuncts)
- **English:** 52 chars (a-z, A-Z)
- **Digits:** 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
- **Symbols:** 33 chars (punctuation, special chars)
- **Special tokens:** 6 tokens (`<pad>`, `<s>`, `</s>`, `<unk>`, `<OCR>`, `<lang_as>`)
- **Total vocabulary:** 187 tokens
## Limitations
- Trained only on printed text (not handwritten)
- Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
- Character-level decoder may struggle with very long sequences (>128 chars)
- Does not handle layout analysis or reading order
- Performance on degraded/low-quality images not extensively tested
## Future Work
- Extend to **MeiteiOCR** for Meitei Mayek script
- Scale to **NE-OCR** covering all 9+ Northeast Indian languages
- Add document layout analysis and reading order detection
- Improve performance with synthetic data augmentation
- Fine-tune for handwritten text recognition
- Extend to multimodal tasks (image captioning, VQA for documents)
## Citation
If you use AssameseOCR in your research, please cite:
```bibtex
@software{assameseocr2026,
author = {MWire Labs},
title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/MWirelabs/assamese-ocr}
}
```
## Acknowledgments
- **Dataset:** Mozhi Indic OCR Dataset by IIT Hyderabad CVIT ([Mathew et al., 2022](https://arxiv.org/abs/2205.06740))
- **Base Model:** Florence-2 by Microsoft Research
- **Organization:** MWire Labs, Shillong, Meghalaya, India
## Contact
- **Organization:** [MWire Labs](https://huggingface.co/MWirelabs)
- **Location:** Shillong, Meghalaya, India
- **Focus:** Language technology for Northeast Indian languages
---
**Part of the MWire Labs NLP suite:**
- [KhasiBERT](https://huggingface.co/MWirelabs/KhasiBERT-110M) - Khasi language model
- [NE-BERT](https://huggingface.co/MWirelabs/NE-BERT) - 9 Northeast languages
- [Kren-M](https://huggingface.co/MWirelabs/Kren-M) - Khasi-English conversational AI
- **AssameseOCR** - Assamese text recognition |