Shikiji (識字)

Shikiji is an experimental single-character classifier for kuzushiji, hentaigana, cursive Chinese/Japanese character forms, and related handwritten or calligraphic glyphs.

Current release: v0.0.2

Hugging Face model repository: kwadraten/shikiji

Files

The current release keeps active artifacts at the repository root:

supervised_pretrain_checkpoint.pt: PyTorch checkpoint with model weights and training report.
supervised_pretrain_checkpoint.onnx: classifier ONNX model.
supervised_pretrain_checkpoint.metadata.json: classifier labels and preprocessing metadata.
supervised_pretrain_checkpoint.embedding.onnx: embedding ONNX model that outputs the pooled pre-classifier feature vector.
supervised_pretrain_checkpoint.embedding.metadata.json: embedding metadata.
supervised_pretrain_report.json: training report.
cache_manifest.json: source/cache manifest for the training run.

Versioning Policy

Git tags in this project define model release versions. The initial public release is v0.0.1.

Before pushing a new model version, move the previous root-level model artifacts into:

old-versions/<previous-version>/

For example, when publishing v0.0.2, move the v0.0.1 root artifacts to:

old-versions/v0.0.1/

Then upload the new active model artifacts to the repository root and create the matching git tag in this project.

Model

Architecture: convnext_tiny.fb_in22k_ft_in1k
Input size: 160x160
Input tensor: NCHW RGB float32 in [0, 1]
Classes: 10,596
Classifier output: logits, shape [batch, 10596]
Embedding output: embedding, shape [batch, 768]
ONNX opset: 17
Dynamic batch: true

Training Snapshot

This release continues training from the v0.0.1 all-source rebuilt checkpoint with replay-weighted variant kana supervision.

Cache: /workspace/classifier_train_cache/all_sources_full_rebuilt
Training samples seen after replay: 2,732,159
Validation samples evaluated: 120,915
Validation top-1: 0.91505
Validation top-5: 0.97573
Validation top-10: 0.98415
Augmentation profile: medium
NINJAL repeat: 2
Variant-kana repeat: 8

Variant Kana Metrics

group	samples	top-1	top-5	top-10
`variant_kana`	2,360	0.88136	0.99407	0.99788
`kwadraten/ninjal-hentaigana`	8,272	0.93061	0.99263	0.99529

These metrics are from the internal validation split of the training cache. They are not a replacement for cross-model benchmark evaluation under a fixed shared test protocol.

Data Notes

The training pool combines Japanese kuzushiji sources and Chinese calligraphy auxiliary data. Label normalization repairs known mojibake in Chinese calligraphy zip paths and maps single-character labels to U+XXXX ids.

Hentaigana and normalized kana require separate evaluation. Some sources use modern hiragana labels for glyphs that may visually resemble variant kana, so downstream evaluation should distinguish variant-level labels from normalized-kana labels.

Deployment Notes

Use the metadata JSON beside each ONNX file to recover labels, decoded characters, Unicode names, preprocessing, model name, and output names.

The ONNX classifier is appropriate for top-k candidate generation. The embedding ONNX is intended for similarity search, retrieval, clustering, and nearest-neighbor inspection.

Changelog

v0.0.2

Strengthened hentaigana / variant-kana support with replay-weighted kwadraten/ninjal-hentaigana and variant-kana samples.
Added validation reporting by label group, including dedicated variant_kana top-k metrics.
Exported refreshed classifier and embedding ONNX artifacts from the replay-trained checkpoint.

v0.0.1

Initial public release of the all-source rebuilt ConvNeXt-Tiny classifier.

Downloads last month: -