assamese-ocr / README.md

Update README.md

19343bf verified 5 days ago

10.8 kB

	---
	language:
	- asm # Assamese ISO 639-1 code
	license: apache-2.0
	base_model: microsoft/Florence-2-large-ft
	tags:
	- vision
	- ocr
	- assamese
	- northeast-india
	- indic-languages
	- character-recognition
	- florence-2
	- vision-language
	datasets:
	- darknight054/indic-mozhi-ocr
	metrics:
	- accuracy
	- character_error_rate
	library_name: transformers
	pipeline_tag: image-to-text

	model-index:
	- name: AssameseOCR
	results:
	- task:
	type: image-to-text
	name: Optical Character Recognition
	dataset:
	name: Mozhi Indic OCR (Assamese)
	type: darknight054/indic-mozhi-ocr
	config: assamese
	split: test
	metrics:
	- type: accuracy
	value: 94.67
	name: Character Accuracy
	verified: false
	- type: character_error_rate
	value: 5.33
	name: Character Error Rate (CER)
	verified: false

	---

	# AssameseOCR

	AssameseOCR is a vision-language model for Optical Character Recognition (OCR) of printed Assamese text. Built on Microsoft's Florence-2-large foundation model with a custom character-level decoder, it achieves 94.67% character accuracy on the Mozhi dataset.

	## Model Details

	### Model Description

	- Developed by: MWire Labs
	- Model type: Vision-Language OCR
	- Language: Assamese (অসমীয়া)
	- License: Apache 2.0
	- Base Model: microsoft/Florence-2-large-ft
	- Architecture: Florence-2 Vision Encoder + Custom Transformer Decoder

	### Model Architecture

	```
	Image (768×768)
	↓
	Florence-2 Vision Encoder (frozen, 360M params)
	↓
	Vision Projection (1024 → 512 dim)
	↓
	Transformer Decoder (4 layers, 8 heads)
	↓
	Character-level predictions (187 vocab)
	```

	Key Components:
	- Vision Encoder: Florence-2-large DaViT architecture (frozen)
	- Decoder: 4-layer Transformer with 512 hidden dimensions
	- Tokenizer: Character-level with 187 tokens (Assamese chars + English + digits + symbols)
	- Total Parameters: 378M (361M frozen, 17.5M trainable)

	## Training Details

	### Training Data

	- Dataset: [Mozhi Indic OCR Dataset](https://huggingface.co/datasets/darknight054/indic-mozhi-ocr) (Assamese subset)
	- Training samples: 79,697 word images
	- Validation samples: 9,945 word images
	- Test samples: 10,146 word images
	- Source: IIT Hyderabad CVIT

	### Training Procedure

	Hardware:
	- GPU: NVIDIA A40 (48GB VRAM)
	- Training time: ~8 hours (3 epochs)

	Hyperparameters:
	- Epochs: 3
	- Batch size: 16
	- Learning rate: 3e-4
	- Optimizer: AdamW (weight_decay=0.01)
	- Scheduler: CosineAnnealingLR
	- Max sequence length: 128 characters
	- Gradient clipping: 1.0

	Training Strategy:
	- Froze Florence-2 vision encoder (leveraging pretrained visual features)
	- Trained only the projection layer and transformer decoder
	- Full fine-tuning (no LoRA) for maximum quality

	## Performance

	### Results

	\| Split \| Character Accuracy \| Loss \|
	\|-------\|-------------------\|------\|
	\| Epoch 1 (Val) \| 91.61% \| 0.2844 \|
	\| Epoch 2 (Val) \| 94.09% \| 0.1548 \|
	\| Epoch 3 (Val) \| 94.67% \| 0.1221 \|

	Character Error Rate (CER): ~5.33%

	### Comparison

	The model achieves strong performance for a foundation model approach:
	- Mozhi paper (CRNN+CTC specialist): ~99% accuracy
	- AssameseOCR (Florence generalist): 94.67% accuracy

	The 5% gap is expected when adapting a general vision-language model versus training a specialized OCR architecture. However, AssameseOCR offers:
	- Extensibility to vision-language tasks (VQA, captioning, document understanding)
	- Faster training (3 epochs vs typical 10-20 for CRNN)
	- Foundation model benefits (transfer learning, robustness)

	## Usage

	### Installation

	```bash
	pip install torch torchvision transformers pillow
	```

	### Inference

	```python
	import torch
	import torch.nn as nn
	from PIL import Image
	from transformers import AutoModelForCausalLM, CLIPImageProcessor
	from huggingface_hub import hf_hub_download
	import json

	# CharTokenizer class
	class CharTokenizer:
	def __init__(self, vocab):
	self.vocab = vocab
	self.char2id = {c: i for i, c in enumerate(vocab)}
	self.id2char = {i: c for i, c in enumerate(vocab)}
	self.pad_token_id = self.char2id["<pad>"]
	self.bos_token_id = self.char2id["<s>"]
	self.eos_token_id = self.char2id["</s>"]

	def encode(self, text, max_length=None, add_special_tokens=True):
	ids = [self.bos_token_id] if add_special_tokens else []
	for ch in text:
	ids.append(self.char2id.get(ch, self.char2id["<unk>"]))
	if add_special_tokens:
	ids.append(self.eos_token_id)
	if max_length:
	ids = ids[:max_length]
	if len(ids) < max_length:
	ids += [self.pad_token_id] * (max_length - len(ids))
	return ids

	def decode(self, ids, skip_special_tokens=True):
	chars = []
	for i in ids:
	ch = self.id2char.get(i, "")
	if skip_special_tokens and ch.startswith("<"):
	continue
	chars.append(ch)
	return "".join(chars)

	@classmethod
	def load(cls, path):
	with open(path, "r", encoding="utf-8") as f:
	vocab = json.load(f)
	return cls(vocab)

	# FlorenceCharOCR model class
	class FlorenceCharOCR(nn.Module):
	def __init__(self, florence_model, vocab_size, vision_hidden_dim, decoder_hidden_dim=512, num_layers=4):
	super().__init__()
	self.florence_model = florence_model

	for param in self.florence_model.parameters():
	param.requires_grad = False

	self.vision_proj = nn.Linear(vision_hidden_dim, decoder_hidden_dim)
	self.embedding = nn.Embedding(vocab_size, decoder_hidden_dim)
	decoder_layer = nn.TransformerDecoderLayer(
	d_model=decoder_hidden_dim,
	nhead=8,
	batch_first=True
	)
	self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
	self.fc_out = nn.Linear(decoder_hidden_dim, vocab_size)

	def forward(self, pixel_values, tgt_ids, tgt_mask=None):
	with torch.no_grad():
	vision_feats = self.florence_model._encode_image(pixel_values)

	vision_feats = self.vision_proj(vision_feats)
	tgt_emb = self.embedding(tgt_ids)
	decoder_out = self.decoder(tgt_emb, vision_feats, tgt_mask=tgt_mask)
	logits = self.fc_out(decoder_out)

	return logits

	# Load components
	device = "cuda" if torch.cuda.is_available() else "cpu"

	# Download files from HuggingFace
	tokenizer_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_char_tokenizer.json")
	model_path = hf_hub_download(repo_id="MWirelabs/assamese-ocr", filename="assamese_ocr_best.pt")

	# Load tokenizer
	char_tokenizer = CharTokenizer.load(tokenizer_path)

	# Load Florence base model
	florence_model = AutoModelForCausalLM.from_pretrained(
	"microsoft/Florence-2-large-ft",
	trust_remote_code=True
	).to(device)

	# Load image processor
	image_processor = CLIPImageProcessor.from_pretrained("microsoft/Florence-2-large-ft")

	# Initialize OCR model
	ocr_model = FlorenceCharOCR(
	florence_model=florence_model,
	vocab_size=len(char_tokenizer.vocab),
	vision_hidden_dim=1024,
	decoder_hidden_dim=512,
	num_layers=4
	).to(device)

	# Load trained weights
	checkpoint = torch.load(model_path, map_location=device)
	ocr_model.load_state_dict(checkpoint['model_state_dict'])
	ocr_model.eval()

	# Inference function
	def recognize_text(image_path):
	# Load and process image
	image = Image.open(image_path).convert("RGB")
	pixel_values = image_processor(images=[image], return_tensors="pt")['pixel_values'].to(device)

	# Generate prediction
	with torch.no_grad():
	# Start with BOS token
	generated_ids = [char_tokenizer.bos_token_id]

	for _ in range(128): # max length
	tgt_tensor = torch.tensor([generated_ids], device=device)
	logits = ocr_model(pixel_values, tgt_tensor)

	# Get next token
	next_token = logits[0, -1].argmax().item()
	generated_ids.append(next_token)

	# Stop if EOS
	if next_token == char_tokenizer.eos_token_id:
	break

	# Decode
	text = char_tokenizer.decode(generated_ids, skip_special_tokens=True)
	return text

	# Example usage
	result = recognize_text("assamese_text.jpg")
	print(f"Recognized text: {result}")
	```

	## Vocabulary

	The character-level tokenizer includes:
	- Assamese characters: 119 unique chars (consonants, vowels, diacritics, conjuncts)
	- English: 52 chars (a-z, A-Z)
	- Digits: 30 chars (ASCII 0-9, Assamese ০-৯, Devanagari ०-९)
	- Symbols: 33 chars (punctuation, special chars)
	- Special tokens: 6 tokens (`<pad>`, `<s>`, `</s>`, `<unk>`, `<OCR>`, `<lang_as>`)
	- Total vocabulary: 187 tokens

	## Limitations

	- Trained only on printed text (not handwritten)
	- Word-level images from Mozhi dataset (may not generalize to full-page OCR without line segmentation)
	- Character-level decoder may struggle with very long sequences (>128 chars)
	- Does not handle layout analysis or reading order
	- Performance on degraded/low-quality images not extensively tested

	## Future Work

	- Extend to MeiteiOCR for Meitei Mayek script
	- Scale to NE-OCR covering all 9+ Northeast Indian languages
	- Add document layout analysis and reading order detection
	- Improve performance with synthetic data augmentation
	- Fine-tune for handwritten text recognition
	- Extend to multimodal tasks (image captioning, VQA for documents)

	## Citation

	If you use AssameseOCR in your research, please cite:

	```bibtex
	@software{assameseocr2026,
	author = {MWire Labs},
	title = {AssameseOCR: Vision-Language Model for Assamese Text Recognition},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/MWirelabs/assamese-ocr}
	}
	```

	## Acknowledgments

	- Dataset: Mozhi Indic OCR Dataset by IIT Hyderabad CVIT ([Mathew et al., 2022](https://arxiv.org/abs/2205.06740))
	- Base Model: Florence-2 by Microsoft Research
	- Organization: MWire Labs, Shillong, Meghalaya, India

	## Contact

	- Organization: [MWire Labs](https://huggingface.co/MWirelabs)
	- Location: Shillong, Meghalaya, India
	- Focus: Language technology for Northeast Indian languages

	---

	Part of the MWire Labs NLP suite:
	- [KhasiBERT](https://huggingface.co/MWirelabs/KhasiBERT-110M) - Khasi language model
	- [NE-BERT](https://huggingface.co/MWirelabs/NE-BERT) - 9 Northeast languages
	- [Kren-M](https://huggingface.co/MWirelabs/Kren-M) - Khasi-English conversational AI
	- AssameseOCR - Assamese text recognition