Namsel BUDA CNN β Tibetan Character Classifier
Lightweight CNN for Tibetan character recognition, replacing the hand-crafted feature pipeline of the Namsel OCR system with learned CNN features used in TradutorBUDA.
94.47% validation accuracy on 1,020 Tibetan character classes. Trained on 47,777 samples derived from the Namsel OCR project dataset. Exported to ONNX (FP32 and INT8) for CPU deployment.
Model Files
| File | Size | Description |
|---|---|---|
best_model.pth |
4.30 MB | PyTorch checkpoint (recommended) |
best_model.onnx |
1.35 MB | ONNX FP32 export |
best_model_int8.onnx |
0.37 MB | ONNX INT8 quantized (fastest on CPU) |
label_mapping.json |
β | Class index β label mapping (1,020 classes) |
Architecture β TibetanCNN
CPU-optimized CNN with depthwise separable convolutions and residual connections.
Input (1 Γ 32 Γ 32 grayscale)
β Stem: Conv2d(1β32, 3Γ3) + BN + ReLU [32 Γ 32 Γ 32]
β Down1: DepthwiseSeparableConv(32β64, stride=2) [64 Γ 16 Γ 16]
β Res1: ResidualBlock(64) [64 Γ 16 Γ 16]
β Down2: DepthwiseSeparableConv(64β128, stride=2) [128 Γ 8 Γ 8]
β Res2: ResidualBlock(128) [128 Γ 8 Γ 8]
β Down3: DepthwiseSeparableConv(128β256,stride=2) [256 Γ 4 Γ 4]
β GlobalAveragePool β Dropout(0.3) β FC(256β1020)
- Parameters: 353,596 (~8β18Γ fewer than standard convolutions)
- Inference: < 5 ms per character on CPU
Results
| Model | Val Accuracy | Parameters | Training Time |
|---|---|---|---|
| TibetanCNN (this model) | 94.47% | 353,596 | 10.1 min |
| CNN+Transformer (comparison) | 95.52% | 491,708 | 18.2 min |
The CNN+Transformer hybrid achieves +1.05% accuracy but uses 39% more parameters and takes 80% longer to train. For CPU OCR deployment, the lightweight CNN with ONNX/INT8 is the practical choice.
Dataset
- Source: Namsel OCR project training data
- Total samples: 47,777 (after deduplication)
- Classes: 1,020 Tibetan character classes
- Image format: 32Γ32 binary (black/white) glyphs, 1 channel
- Split: 85% train (40,611) / 15% validation (7,166), stratified
Data sources within the dataset:
font-draw-samples: 16,320 synthetically rendered samplespkl files: 19,613 additional character dataui_samples: 9,229 UI-extracted samplesnormalized_3216_to_3232: 7,117 resized/normalized characterstibcharsamples: 1,619 manually collected imagessymbols: 15 punctuation/special symbols
Training Configuration
- Optimizer: AdamW (lr=1e-3, weight_decay=1e-4)
- LR Schedule: 5-epoch linear warmup β cosine annealing to 1e-6
- Loss: CrossEntropyLoss with class weighting + label smoothing (0.1)
- Augmentation: Mixup (alpha=0.2)
- Mixed precision: AMP/BF16 on NVIDIA A100 (40 GB)
- Batch size: 512
- Epochs: 100 (early stopping patience=15, did not trigger)
- Best epoch: 90
Usage
PyTorch
import torch
import numpy as np
from model import TibetanCNN
import json
with open("label_mapping.json") as f:
mapping = json.load(f)
checkpoint = torch.load("best_model.pth", map_location="cpu", weights_only=False)
model = TibetanCNN(num_classes=mapping["num_classes"], dropout=0.3)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# image: numpy array of shape (32, 32), binary pixels (0 or 1)
img_tensor = torch.tensor(image, dtype=torch.float32).unsqueeze(0).unsqueeze(0) # [1,1,32,32]
with torch.no_grad():
logits = model(img_tensor)
pred_idx = logits.argmax(dim=1).item()
pred_label = mapping["idx_to_label"][str(pred_idx)]
ONNX (CPU, fastest with INT8)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("best_model_int8.onnx", providers=["CPUExecutionProvider"])
# image: numpy array (32, 32), float32, values 0.0 or 1.0
img = image.astype(np.float32).reshape(1, 1, 32, 32)
logits = session.run(None, {"image": img})[0]
pred_idx = logits.argmax(axis=1)[0]
Using predict.py
from predict import TibetanCNNPredictor
predictor = TibetanCNNPredictor("best_model.pth", "label_mapping.json")
top3 = predictor.predict_top_k(image_array, k=3)
# returns list of (label_idx, confidence) tuples
Background
Tibetan script OCR is an underserved area in document digitization. The original Namsel OCR system (Rowinski, 2016) used hand-crafted features (Zernike moments, Sobel gradients, pixel transition counts) with a scikit-learn classifier, struggling with font variation and low-quality images.
This model replaces that pipeline with learned CNN features, achieving significantly higher accuracy and better generalization across font styles.
References
- Rowinski, T. "Namsel OCR" (2016). https://escholarship.org/uc/item/6d5781k5
- Wang et al. "Unsupervised Tibetan Historical Document Recognition" (2024). https://www.mdpi.com/2076-3417/14/5/2142
- Loshchilov & Hutter. "Decoupled Weight Decay Regularization" (ICLR 2019). https://arxiv.org/abs/1711.05101
- Zhang et al. "mixup: Beyond Empirical Risk Minimization" (ICLR 2018). https://arxiv.org/abs/1710.09412