metadata
language:
- asm
license: apache-2.0
base_model: microsoft/Florence-2-large-ft
tags:
- vision
- ocr
- assamese
- northeast-india
- indic-languages
- character-recognition
- florence-2
- vision-language
datasets:
- darknight054/indic-mozhi-ocr
metrics:
- accuracy
- character_error_rate
library_name: transformers
pipeline_tag: image-to-text
model-index:
- name: AssameseOCR
results:
- task:
type: image-to-text
name: Optical Character Recognition
dataset:
name: Mozhi Indic OCR (Assamese)
type: darknight054/indic-mozhi-ocr
config: assamese
split: test
metrics:
- type: accuracy
value: 94.67
name: Character Accuracy
verified: false
- type: character_error_rate
value: 5.33
name: Character Error Rate (CER)
verified: false
AssameseOCR
AssameseOCR is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.
Model Details
Model Description
- Developed by: MWire Labs
- Model type: Vision-Language OCR
- Language: Assamese (অসমীয়া)
- License: Apache 2.0
- Base Model: microsoft/Florence-2-large-ft
- Architecture: Florence-2 Vision Encoder + Custom Transformer Decoder
Model Architecture
Image (768×768)
↓
Florence-2 Vision Encoder (frozen, 360M params)
↓
Vision Projection (1024 → 512 dim)
↓
Transformer Decoder (4 layers, 8 heads)
↓
Character-level predictions (187 vocab)
Key Components:
- Vision Encoder: Florence-2-large DaViT architecture (frozen)
- Decoder: 4-layer Transformer with 512 hidden dimensions
- Tokenizer: Character-level with 187 tokens (Assamese chars + English + digits + symbols)
- Total Parameters: 378M (361M frozen, 17.5M trainable)
Training Details
Training Data
- Dataset: Mozhi Indic OCR Dataset (Assamese subset)
- Training samples: 79,697 word images
- Validation samples: 9,945 word images
- Test samples: 10,146 word images
- Source: IIT Hyderabad CVIT
Training Procedure
Hardware:
- GPU: NVIDIA A40 (48GB VRAM)
- Training time: ~8 hours (3 epochs)
Hyperparameters:
- Epochs: 3
- Batch size: 16
- Learning rate: 3e-4
- Optimizer: AdamW (weight_decay=0.01)
- Scheduler: CosineAnnealingLR
- Max sequence length: 128 characters
- Gradient clipping: 1.0
Training Strategy:
- Froze Florence-2 vision encoder (leveraging pretrained visual features)
- Trained only the projection layer and transformer decoder
- Full fine-tuning (no LoRA) for maximum quality
Performance
Results
| Split | Character Accuracy | Loss |
|---|---|---|
| Epoch 1 (Val) | 91.61% | 0.2844 |
| Epoch 2 (Val) | 94.09% | 0.1548 |
| Epoch 3 (Val) | 94.67% | 0.1221 |
Character Error Rate (CER): ~5.33%
Comparison
The model achieves strong performance for a foundation model approach:
- Mozhi paper (CRNN+CTC specialist): ~99% accuracy
- AssameseOCR (Florence generalist): 94.67% accuracy
The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:
- Extensibility to vision-language tasks (VQA, captioning, document understanding)
- Faster training (3 epochs vs typical 10-20 for CRNN)
- Foundation model benefits (transfer learning, robustness)
Usage
Installation
pip install torch torchvision transformers pillow
Inference
import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoModelForCausalLM, CLIPImageProcessor
from huggingface_hub import hf_hub_download
import json
# CharTokenizer class
class CharTokenizer:
def __init__(self, vocab):
self.vocab = vocab
self.char2id = {c: i for i, c in enumerate(vocab)}
self.id2char = {i: c for i, c in enumerate(vocab)}
self.pad_token_id = self.char2id["<pad>"]
self.bos_token_id = self.char2id["<s>"]
self.eos_token_id = self.char2id["</s>"]
def encode(self, text, max_length=None, add_special_tokens=True):
ids = [self.bos_token_id] if add_special_tokens else []
for ch in text:
ids.append(self.char2id.get(ch, self.char2id["<unk>"]))
if add_special_tokens:
ids.append(self.eos_token_id)
if max_length:
ids = ids[:max_length]
if len(ids) < max_length:
ids += [self.pad_token_id] * (max_length - len(ids))
return ids
def decode(self, ids, skip_special_tokens=True):
chars = []
for i in ids:
ch = self.id2char.get(i, "")
if skip_special_tokens and ch.startswith("<"):
continue
chars.append(ch)
return "".join(chars)
@classmethod
def load(cls, path):
with open(path, "r", encoding="utf-8") as f:
vocab = json.load(f)
return cls(vocab)
# FlorenceCharOCR model class
class FlorenceCharOCR(nn.Module):
def __init__(self, florence_model, vocab_size, vision_hidden_dim, decoder_hidden_dim=512, num_layers=4):
super().__init__()
self.florence_model = florence_model
for param in self.florence_model.parameters():
param.requires_grad = False
self.vision_proj = nn.Linear(vision_hidden_dim, decoder_hidden_dim)
self.embedding = nn.Embedding(vocab_size, decoder_hidden_dim)
decoder_layer = nn.TransformerDecoderLayer(
d_model=decoder_hidden_dim,
nhead=8,
batch_first=True
)
self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
self.fc_out = nn.Linear(decoder_hidden_dim, vocab_size)
def forward(self, pixel_values, tgt_ids, tgt_mask=None):
with torch.no_grad():
vision_feats = self.florence_model._encode_image(pixel_values)
vision_feats = self.vision_proj(vision_feats)
tgt_emb = self.embedding(tgt_ids)
decoder_out = self.decoder(tgt_emb, vision_feats, tgt_mask=tgt_mask)
logits = self.fc_out(decoder_out)
return logits
# Load components
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download files from HuggingFace
tokenizer_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_char_tokenizer.json")
model_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_ocr_best.pt")
# Load tokenizer
char_tokenizer = CharTokenizer.load(tokenizer_path)
# Load Florence base model
florence_model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-large-ft",
trust_remote_code=True
).to(device)
# Load image processor
image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")
# Initialize OCR model
ocr_model = FlorenceCharOCR(
florence_model=florence_model,
vocab_size=len(char_tokenizer.vocab),
vision_hidden_dim=1024,
decoder_hidden_dim=512,
num_layers=4
).to(device)
# Load trained weights
checkpoint = torch.load(model_path, map_location=device)
ocr_model.load_state_dict(checkpoint['model_state_dict'])
ocr_model.eval()
# Inference function
def recognize_text(image_path):
# Load and process image
image = Image.open(image_path).convert("RGB")
pixel_values = image_processor(images=[image], return_tensors="pt")['pixel_values'].to(device)
# Generate prediction
with torch.no_grad():
# Start with BOS token
generated_ids = [char_tokenizer.bos_token_id]
for _ in range(128): # max length
tgt_tensor = torch.tensor([generated_ids], device=device)
logits = ocr_model(pixel_values, tgt_tensor)
# Get next token
next_token = logits[0, -1].argmax().item()
generated_ids.append(next_token)
# Stop if EOS
if next_token == char_tokenizer.eos_token_id:
break
# Decode
text = char_tokenizer.decode(generated_ids, skip_special_tokens=True)
return text
# Example usage
result = recognize_text("assamese_text.jpg")
print(f"Recognized text: {result}")
Vocabulary
The character-level tokenizer includes:
- Assamese characters: 119 unique chars (consonants, vowels, diacritics, conjuncts)
- English: 52 chars (a-z, A-Z)
- Digits: 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
- Symbols: 33 chars (punctuation, special chars)
- Special tokens: 6 tokens (
<pad>,<s>,</s>,<unk>,<OCR>,<lang_as>) - Total vocabulary: 187 tokens
Limitations
- Trained only on printed text (not handwritten)
- Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
- Character-level decoder may struggle with very long sequences (>128 chars)
- Does not handle layout analysis or reading order
- Performance on degraded/low-quality images not extensively tested
Future Work
- Extend to MeiteiOCR for Meitei Mayek script
- Scale to NE-OCR covering all 9+ Northeast Indian languages
- Add document layout analysis and reading order detection
- Improve performance with synthetic data augmentation
- Fine-tune for handwritten text recognition
- Extend to multimodal tasks (image captioning, VQA for documents)
Citation
If you use AssameseOCR in your research, please cite:
@software{assameseocr2026,
author = {MWire Labs},
title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/MWirelabs/assamese-ocr}
}
Acknowledgments
- Dataset: Mozhi Indic OCR Dataset by IIT Hyderabad CVIT (Mathew et al., 2022)
- Base Model: Florence-2 by Microsoft Research
- Organization: MWire Labs, Shillong, Meghalaya, India
Contact
- Organization: MWire Labs
- Location: Shillong, Meghalaya, India
- Focus: Language technology for Northeast Indian languages
Part of the MWire Labs NLP suite: