Namsel BUDA CNN β€” Tibetan Character Classifier

Lightweight CNN for Tibetan character recognition, replacing the hand-crafted feature pipeline of the Namsel OCR system with learned CNN features used in TradutorBUDA.

94.47% validation accuracy on 1,020 Tibetan character classes. Trained on 47,777 samples derived from the Namsel OCR project dataset. Exported to ONNX (FP32 and INT8) for CPU deployment.

Model Files

File Size Description
best_model.pth 4.30 MB PyTorch checkpoint (recommended)
best_model.onnx 1.35 MB ONNX FP32 export
best_model_int8.onnx 0.37 MB ONNX INT8 quantized (fastest on CPU)
label_mapping.json β€” Class index ↔ label mapping (1,020 classes)

Architecture β€” TibetanCNN

CPU-optimized CNN with depthwise separable convolutions and residual connections.

Input (1 Γ— 32 Γ— 32 grayscale)
β†’ Stem:  Conv2d(1β†’32, 3Γ—3) + BN + ReLU          [32 Γ— 32 Γ— 32]
β†’ Down1: DepthwiseSeparableConv(32β†’64, stride=2)  [64 Γ— 16 Γ— 16]
β†’ Res1:  ResidualBlock(64)                        [64 Γ— 16 Γ— 16]
β†’ Down2: DepthwiseSeparableConv(64β†’128, stride=2) [128 Γ— 8 Γ— 8]
β†’ Res2:  ResidualBlock(128)                       [128 Γ— 8 Γ— 8]
β†’ Down3: DepthwiseSeparableConv(128β†’256,stride=2) [256 Γ— 4 Γ— 4]
β†’ GlobalAveragePool β†’ Dropout(0.3) β†’ FC(256β†’1020)
  • Parameters: 353,596 (~8–18Γ— fewer than standard convolutions)
  • Inference: < 5 ms per character on CPU

Results

Model Val Accuracy Parameters Training Time
TibetanCNN (this model) 94.47% 353,596 10.1 min
CNN+Transformer (comparison) 95.52% 491,708 18.2 min

The CNN+Transformer hybrid achieves +1.05% accuracy but uses 39% more parameters and takes 80% longer to train. For CPU OCR deployment, the lightweight CNN with ONNX/INT8 is the practical choice.

Dataset

  • Source: Namsel OCR project training data
  • Total samples: 47,777 (after deduplication)
  • Classes: 1,020 Tibetan character classes
  • Image format: 32Γ—32 binary (black/white) glyphs, 1 channel
  • Split: 85% train (40,611) / 15% validation (7,166), stratified

Data sources within the dataset:

  • font-draw-samples: 16,320 synthetically rendered samples
  • pkl files: 19,613 additional character data
  • ui_samples: 9,229 UI-extracted samples
  • normalized_3216_to_3232: 7,117 resized/normalized characters
  • tibcharsamples: 1,619 manually collected images
  • symbols: 15 punctuation/special symbols

Training Configuration

  • Optimizer: AdamW (lr=1e-3, weight_decay=1e-4)
  • LR Schedule: 5-epoch linear warmup β†’ cosine annealing to 1e-6
  • Loss: CrossEntropyLoss with class weighting + label smoothing (0.1)
  • Augmentation: Mixup (alpha=0.2)
  • Mixed precision: AMP/BF16 on NVIDIA A100 (40 GB)
  • Batch size: 512
  • Epochs: 100 (early stopping patience=15, did not trigger)
  • Best epoch: 90

Usage

PyTorch

import torch
import numpy as np
from model import TibetanCNN
import json

with open("label_mapping.json") as f:
    mapping = json.load(f)

checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model = TibetanCNN(num_classes=mapping["num_classes"], dropout=0.3)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# image: numpy array of shape (32, 32), binary pixels (0 or 1)
img_tensor = torch.tensor(image, dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1,1,32,32]
with torch.no_grad():
    logits = model(img_tensor)
pred_idx = logits.argmax(dim=1).item()
pred_label = mapping["idx_to_label"][str(pred_idx)]

ONNX (CPU, fastest with INT8)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("best_model_int8.onnx", providers=["CPUExecutionProvider"])

# image: numpy array (32, 32), float32, values 0.0 or 1.0
img = image.astype(np.float32).reshape(1, 1, 32, 32)
logits = session.run(None, {"image": img})[0]
pred_idx = logits.argmax(axis=1)[0]

Using predict.py

from predict import TibetanCNNPredictor

predictor = TibetanCNNPredictor("best_model.pth", "label_mapping.json")
top3 = predictor.predict_top_k(image_array, k=3)
# returns list of (label_idx, confidence) tuples

Background

Tibetan script OCR is an underserved area in document digitization. The original Namsel OCR system (Rowinski, 2016) used hand-crafted features (Zernike moments, Sobel gradients, pixel transition counts) with a scikit-learn classifier, struggling with font variation and low-quality images.

This model replaces that pipeline with learned CNN features, achieving significantly higher accuracy and better generalization across font styles.

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for trabten/namsel_BUDA_CNN