assamese-ocr / README.md
Khrawsynth's picture
Update README.md
19343bf verified
metadata
language:
  - asm
license: apache-2.0
base_model: microsoft/Florence-2-large-ft
tags:
  - vision
  - ocr
  - assamese
  - northeast-india
  - indic-languages
  - character-recognition
  - florence-2
  - vision-language
datasets:
  - darknight054/indic-mozhi-ocr
metrics:
  - accuracy
  - character_error_rate
library_name: transformers
pipeline_tag: image-to-text
model-index:
  - name: AssameseOCR
    results:
      - task:
          type: image-to-text
          name: Optical Character Recognition
        dataset:
          name: Mozhi Indic OCR (Assamese)
          type: darknight054/indic-mozhi-ocr
          config: assamese
          split: test
        metrics:
          - type: accuracy
            value: 94.67
            name: Character Accuracy
            verified: false
          - type: character_error_rate
            value: 5.33
            name: Character Error Rate (CER)
            verified: false

AssameseOCR

AssameseOCR is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.

Model Details

Model Description

  • Developed by: MWire Labs
  • Model type: Vision-Language OCR
  • Language: Assamese (অসমীয়া)
  • License: Apache 2.0
  • Base Model: microsoft/Florence-2-large-ft
  • Architecture: Florence-2 Vision Encoder + Custom Transformer Decoder

Model Architecture

Image (768×768) 
  ↓
Florence-2 Vision Encoder (frozen, 360M params)
  ↓
Vision Projection (1024 → 512 dim)
  ↓
Transformer Decoder (4 layers, 8 heads)
  ↓
Character-level predictions (187 vocab)

Key Components:

  • Vision Encoder: Florence-2-large DaViT architecture (frozen)
  • Decoder: 4-layer Transformer with 512 hidden dimensions
  • Tokenizer: Character-level with 187 tokens (Assamese chars + English + digits + symbols)
  • Total Parameters: 378M (361M frozen, 17.5M trainable)

Training Details

Training Data

  • Dataset: Mozhi Indic OCR Dataset (Assamese subset)
  • Training samples: 79,697 word images
  • Validation samples: 9,945 word images
  • Test samples: 10,146 word images
  • Source: IIT Hyderabad CVIT

Training Procedure

Hardware:

  • GPU: NVIDIA A40 (48GB VRAM)
  • Training time: ~8 hours (3 epochs)

Hyperparameters:

  • Epochs: 3
  • Batch size: 16
  • Learning rate: 3e-4
  • Optimizer: AdamW (weight_decay=0.01)
  • Scheduler: CosineAnnealingLR
  • Max sequence length: 128 characters
  • Gradient clipping: 1.0

Training Strategy:

  • Froze Florence-2 vision encoder (leveraging pretrained visual features)
  • Trained only the projection layer and transformer decoder
  • Full fine-tuning (no LoRA) for maximum quality

Performance

Results

Split Character Accuracy Loss
Epoch 1 (Val) 91.61% 0.2844
Epoch 2 (Val) 94.09% 0.1548
Epoch 3 (Val) 94.67% 0.1221

Character Error Rate (CER): ~5.33%

Comparison

The model achieves strong performance for a foundation model approach:

  • Mozhi paper (CRNN+CTC specialist): ~99% accuracy
  • AssameseOCR (Florence generalist): 94.67% accuracy

The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:

  • Extensibility to vision-language tasks (VQA, captioning, document understanding)
  • Faster training (3 epochs vs typical 10-20 for CRNN)
  • Foundation model benefits (transfer learning, robustness)

Usage

Installation

pip install torch torchvision transformers pillow

Inference

import torch
import torch.nn as nn
from PIL import Image
from transformers import AutoModelForCausalLM, CLIPImageProcessor
from huggingface_hub import hf_hub_download
import json

# CharTokenizer class
class CharTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.char2id = {c: i for i, c in enumerate(vocab)}
        self.id2char = {i: c for i, c in enumerate(vocab)}
        self.pad_token_id = self.char2id["<pad>"]
        self.bos_token_id = self.char2id["<s>"]
        self.eos_token_id = self.char2id["</s>"]
        
    def encode(self, text, max_length=None, add_special_tokens=True):
        ids = [self.bos_token_id] if add_special_tokens else []
        for ch in text:
            ids.append(self.char2id.get(ch, self.char2id["<unk>"]))
        if add_special_tokens:
            ids.append(self.eos_token_id)
        if max_length:
            ids = ids[:max_length]
            if len(ids) < max_length:
                ids += [self.pad_token_id] * (max_length - len(ids))
        return ids
        
    def decode(self, ids, skip_special_tokens=True):
        chars = []
        for i in ids:
            ch = self.id2char.get(i, "")
            if skip_special_tokens and ch.startswith("<"):
                continue
            chars.append(ch)
        return "".join(chars)
    
    @classmethod
    def load(cls, path):
        with open(path, "r", encoding="utf-8") as f:
            vocab = json.load(f)
        return cls(vocab)

# FlorenceCharOCR model class
class FlorenceCharOCR(nn.Module):
    def __init__(self, florence_model, vocab_size, vision_hidden_dim, decoder_hidden_dim=512, num_layers=4):
        super().__init__()
        self.florence_model = florence_model
        
        for param in self.florence_model.parameters():
            param.requires_grad = False
        
        self.vision_proj = nn.Linear(vision_hidden_dim, decoder_hidden_dim)
        self.embedding = nn.Embedding(vocab_size, decoder_hidden_dim)
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=decoder_hidden_dim, 
            nhead=8, 
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(decoder_hidden_dim, vocab_size)
        
    def forward(self, pixel_values, tgt_ids, tgt_mask=None):
        with torch.no_grad():
            vision_feats = self.florence_model._encode_image(pixel_values)
        
        vision_feats = self.vision_proj(vision_feats)
        tgt_emb = self.embedding(tgt_ids)
        decoder_out = self.decoder(tgt_emb, vision_feats, tgt_mask=tgt_mask)
        logits = self.fc_out(decoder_out)
        
        return logits

# Load components
device = "cuda" if torch.cuda.is_available() else "cpu"

# Download files from HuggingFace
tokenizer_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_char_tokenizer.json")
model_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_ocr_best.pt")

# Load tokenizer
char_tokenizer = CharTokenizer.load(tokenizer_path)

# Load Florence base model
florence_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Florence-2-large-ft",
    trust_remote_code=True
).to(device)

# Load image processor
image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")

# Initialize OCR model
ocr_model = FlorenceCharOCR(
    florence_model=florence_model,
    vocab_size=len(char_tokenizer.vocab),
    vision_hidden_dim=1024,
    decoder_hidden_dim=512,
    num_layers=4
).to(device)

# Load trained weights
checkpoint = torch.load(model_path, map_location=device)
ocr_model.load_state_dict(checkpoint['model_state_dict'])
ocr_model.eval()

# Inference function
def recognize_text(image_path):
    # Load and process image
    image = Image.open(image_path).convert("RGB")
    pixel_values = image_processor(images=[image], return_tensors="pt")['pixel_values'].to(device)
    
    # Generate prediction
    with torch.no_grad():
        # Start with BOS token
        generated_ids = [char_tokenizer.bos_token_id]
        
        for _ in range(128):  # max length
            tgt_tensor = torch.tensor([generated_ids], device=device)
            logits = ocr_model(pixel_values, tgt_tensor)
            
            # Get next token
            next_token = logits[0, -1].argmax().item()
            generated_ids.append(next_token)
            
            # Stop if EOS
            if next_token == char_tokenizer.eos_token_id:
                break
    
    # Decode
    text = char_tokenizer.decode(generated_ids, skip_special_tokens=True)
    return text

# Example usage
result = recognize_text("assamese_text.jpg")
print(f"Recognized text: {result}")

Vocabulary

The character-level tokenizer includes:

  • Assamese characters: 119 unique chars (consonants, vowels, diacritics, conjuncts)
  • English: 52 chars (a-z, A-Z)
  • Digits: 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
  • Symbols: 33 chars (punctuation, special chars)
  • Special tokens: 6 tokens (<pad>, <s>, </s>, <unk>, <OCR>, <lang_as>)
  • Total vocabulary: 187 tokens

Limitations

  • Trained only on printed text (not handwritten)
  • Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
  • Character-level decoder may struggle with very long sequences (>128 chars)
  • Does not handle layout analysis or reading order
  • Performance on degraded/low-quality images not extensively tested

Future Work

  • Extend to MeiteiOCR for Meitei Mayek script
  • Scale to NE-OCR covering all 9+ Northeast Indian languages
  • Add document layout analysis and reading order detection
  • Improve performance with synthetic data augmentation
  • Fine-tune for handwritten text recognition
  • Extend to multimodal tasks (image captioning, VQA for documents)

Citation

If you use AssameseOCR in your research, please cite:

@software{assameseocr2026,
  author = {MWire Labs},
  title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/MWirelabs/assamese-ocr}
}

Acknowledgments

  • Dataset: Mozhi Indic OCR Dataset by IIT Hyderabad CVIT (Mathew et al., 2022)
  • Base Model: Florence-2 by Microsoft Research
  • Organization: MWire Labs, Shillong, Meghalaya, India

Contact

  • Organization: MWire Labs
  • Location: Shillong, Meghalaya, India
  • Focus: Language technology for Northeast Indian languages

Part of the MWire Labs NLP suite:

  • KhasiBERT - Khasi language model
  • NE-BERT - 9 Northeast languages
  • Kren-M - Khasi-English conversational AI
  • AssameseOCR - Assamese text recognition