Shikiji (識字)

Shikiji is an experimental single-character classifier for kuzushiji, hentaigana, cursive Chinese/Japanese character forms, and related handwritten or calligraphic glyphs.

Current release: v0.0.1

Hugging Face model repository: kwadraten/shikiji

Files

The current release keeps active artifacts at the repository root:

  • supervised_pretrain_checkpoint.pt: PyTorch checkpoint with model weights and training report.
  • supervised_pretrain_checkpoint.onnx: classifier ONNX model.
  • supervised_pretrain_checkpoint.metadata.json: classifier labels and preprocessing metadata.
  • supervised_pretrain_checkpoint.embedding.onnx: embedding ONNX model that outputs the pooled pre-classifier feature vector.
  • supervised_pretrain_checkpoint.embedding.metadata.json: embedding metadata.
  • supervised_pretrain_report.json: training report.
  • cache_manifest.json: source/cache manifest for the training run.

Versioning Policy

Git tags in this project define model release versions. The initial public release is v0.0.1.

Before pushing a new model version, move the previous root-level model artifacts into:

old-versions/<previous-version>/

For example, when publishing v0.0.2, move the v0.0.1 root artifacts to:

old-versions/v0.0.1/

Then upload the new active model artifacts to the repository root and create the matching git tag in this project.

Model

  • Architecture: convnext_tiny.fb_in22k_ft_in1k
  • Input size: 160x160
  • Input tensor: NCHW RGB float32 in [0, 1]
  • Classes: 10,596
  • Classifier output: logits, shape [batch, 10596]
  • Embedding output: embedding, shape [batch, 768]

Training Snapshot

This release was trained from the all-source rebuilt cache.

  • Training samples seen: 2,306,957
  • Validation samples evaluated: 50,000
  • Validation top-1: 0.90372
  • Validation top-5: 0.96970
  • Validation top-10: 0.97916

These metrics are from the internal validation split of the training cache. They are not a replacement for cross-model benchmark evaluation under a fixed shared test protocol.

Data Notes

The training pool combines Japanese kuzushiji sources and Chinese calligraphy auxiliary data. Label normalization repairs known mojibake in Chinese calligraphy zip paths and maps single-character labels to U+XXXX ids.

Hentaigana and normalized kana require separate evaluation. Some sources use modern hiragana labels for glyphs that may visually resemble variant kana, so downstream evaluation should distinguish variant-level labels from normalized-kana labels.

Deployment Notes

Use the metadata JSON beside each ONNX file to recover labels, decoded characters, Unicode names, preprocessing, model name, and output names.

The ONNX classifier is appropriate for top-k candidate generation. The embedding ONNX is intended for similarity search, retrieval, clustering, and nearest-neighbor inspection.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support