Namsel BUDA CNN — Tibetan Character Classifier

Lightweight CNN for Tibetan character recognition, replacing the hand-crafted feature pipeline of the Namsel OCR system with learned CNN features used in TradutorBUDA.

94.47% validation accuracy on 1,020 Tibetan character classes. Trained on 47,777 samples derived from the Namsel OCR project dataset. Exported to ONNX (FP32 and INT8) for CPU deployment.

Model Files

File	Size	Description
`best_model.pth`	4.30 MB	PyTorch checkpoint (recommended)
`best_model.onnx`	1.35 MB	ONNX FP32 export
`best_model_int8.onnx`	0.37 MB	ONNX INT8 quantized (fastest on CPU)
`label_mapping.json`	—	Class index ↔ label mapping (1,020 classes)

Architecture — TibetanCNN

CPU-optimized CNN with depthwise separable convolutions and residual connections.

Input (1 × 32 × 32 grayscale)
→ Stem:  Conv2d(1→32, 3×3) + BN + ReLU          [32 × 32 × 32]
→ Down1: DepthwiseSeparableConv(32→64, stride=2)  [64 × 16 × 16]
→ Res1:  ResidualBlock(64)                        [64 × 16 × 16]
→ Down2: DepthwiseSeparableConv(64→128, stride=2) [128 × 8 × 8]
→ Res2:  ResidualBlock(128)                       [128 × 8 × 8]
→ Down3: DepthwiseSeparableConv(128→256,stride=2) [256 × 4 × 4]
→ GlobalAveragePool → Dropout(0.3) → FC(256→1020)

Parameters: 353,596 (~8–18× fewer than standard convolutions)
Inference: < 5 ms per character on CPU

Results

Model	Val Accuracy	Parameters	Training Time
TibetanCNN (this model)	94.47%	353,596	10.1 min
CNN+Transformer (comparison)	95.52%	491,708	18.2 min

The CNN+Transformer hybrid achieves +1.05% accuracy but uses 39% more parameters and takes 80% longer to train. For CPU OCR deployment, the lightweight CNN with ONNX/INT8 is the practical choice.

Dataset

Source: Namsel OCR project training data
Total samples: 47,777 (after deduplication)
Classes: 1,020 Tibetan character classes
Image format: 32×32 binary (black/white) glyphs, 1 channel
Split: 85% train (40,611) / 15% validation (7,166), stratified

Data sources within the dataset:

font-draw-samples: 16,320 synthetically rendered samples
pkl files: 19,613 additional character data
ui_samples: 9,229 UI-extracted samples
normalized_3216_to_3232: 7,117 resized/normalized characters
tibcharsamples: 1,619 manually collected images
symbols: 15 punctuation/special symbols

Training Configuration

Optimizer: AdamW (lr=1e-3, weight_decay=1e-4)
LR Schedule: 5-epoch linear warmup → cosine annealing to 1e-6
Loss: CrossEntropyLoss with class weighting + label smoothing (0.1)
Augmentation: Mixup (alpha=0.2)
Mixed precision: AMP/BF16 on NVIDIA A100 (40 GB)
Batch size: 512
Epochs: 100 (early stopping patience=15, did not trigger)
Best epoch: 90

Usage

PyTorch

import torch
import numpy as np
from model import TibetanCNN
import json

with open("label_mapping.json") as f:
    mapping = json.load(f)

checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model = TibetanCNN(num_classes=mapping["num_classes"], dropout=0.3)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# image: numpy array of shape (32, 32), binary pixels (0 or 1)
img_tensor = torch.tensor(image, dtype=torch.float32).unsqueeze(0).unsqueeze(0)  # [1,1,32,32]
with torch.no_grad():
    logits = model(img_tensor)
pred_idx = logits.argmax(dim=1).item()
pred_label = mapping["idx_to_label"][str(pred_idx)]

ONNX (CPU, fastest with INT8)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("best_model_int8.onnx", providers=["CPUExecutionProvider"])

# image: numpy array (32, 32), float32, values 0.0 or 1.0
img = image.astype(np.float32).reshape(1, 1, 32, 32)
logits = session.run(None, {"image": img})[0]
pred_idx = logits.argmax(axis=1)[0]

Using predict.py

from predict import TibetanCNNPredictor

predictor = TibetanCNNPredictor("best_model.pth", "label_mapping.json")
top3 = predictor.predict_top_k(image_array, k=3)
# returns list of (label_idx, confidence) tuples

Background

Tibetan script OCR is an underserved area in document digitization. The original Namsel OCR system (Rowinski, 2016) used hand-crafted features (Zernike moments, Sobel gradients, pixel transition counts) with a scikit-learn classifier, struggling with font variation and low-quality images.

This model replaces that pipeline with learned CNN features, achieving significantly higher accuracy and better generalization across font styles.

References

Rowinski, T. "Namsel OCR" (2016). https://escholarship.org/uc/item/6d5781k5
Wang et al. "Unsupervised Tibetan Historical Document Recognition" (2024). https://www.mdpi.com/2076-3417/14/5/2142
Loshchilov & Hutter. "Decoupled Weight Decay Regularization" (ICLR 2019). https://arxiv.org/abs/1711.05101
Zhang et al. "mixup: Beyond Empirical Risk Minimization" (ICLR 2018). https://arxiv.org/abs/1710.09412

Downloads last month: -; Downloads are not tracked for this model. How to track

Papers for trabten/namsel_BUDA_CNN

Decoupled Weight Decay Regularization

Paper • 1711.05101 • Published Nov 14, 2017 • 4

mixup: Beyond Empirical Risk Minimization

Paper • 1710.09412 • Published Oct 25, 2017