AnandaSky

AnandaSky is a vision-language model for line-level transcription of historical sinographic documents.

The name combines Ananda—the disciple of the Buddha traditionally associated with the "encoding" of early Buddhist texts—and Sky, the opening character of the Thousand Character Classic, a text long used in premodern China to enumerate items.

Paper

This model is described in the following paper:

AnandaSky: A Vision--Language Model for Line-Level Transcription of Historical Sinographic Documents

Model Overview

AnandaSky is a vision-language model for efficient transcription of historical sinographic line images. It contains approximately 626M parameters and combines:

  • a Vision Transformer (ViT) encoder
  • an autoregressive Qwen3-based decoder

The model was trained on 4 million line images extracted from historical documents produced in China and Korea between the 8th and 20th centuries, including both printed editions and handwritten manuscripts.

A full description of the datasets, preprocessing pipeline, and training procedure is provided in the accompanying paper.

Evaluation Results

The model achieves the following character error rates (CER) on in-domain test sets:

Dataset CER
MTHv2 0.92%
Sibu Congkan 0.43%
Korean Anthologies 0.33%
Dunhuang Manuscripts 1.38%
Qing Legal Documents 4.89%

The model achieves the following character error rates (CER) on held-out benchmarks:

Dataset CER
ICDAR2019-HDRC 0.96%
CUHK Challenge 2021 0.82%
CUHK Challenge 2022 1.61%

Intended Use

AnandaSky is intended for line-level transcription of historical sinographic documents. It can process both single-column and double-column vertical text layouts.

Transcription Normalization

If you notice that the model systematically produces an incorrect transcription for a specific character or glyph form, please consider opening an issue in the repository. Such reports are valuable for improving the normalization pipeline and future model releases.

Hardware and Dependencies

This model has a hard dependency on FlashAttention.

Required Environment

  • Python >= 3.10
  • PyTorch >= 2.1
  • NVIDIA Ampere-or-newer GPU
  • transformers
  • flash-attn

Install FlashAttention

pip install flash-attn --no-build-isolation

⚠️ FlashAttention is required. The model will not run without it.

Loading the Model

Because this repository uses custom Transformers modeling code, the model must be loaded with trust_remote_code=True.

Example

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "badianeai/AnandaSky",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
)

Minimal Inference Example

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

DEVICE = torch.device("cuda")
DTYPE = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained("badianeai/AnandaSky",
                                            trust_remote_code=True,
                                            torch_dtype=DTYPE)

model = model.to(DEVICE)

image = Image.open("line_image.png")

processor = AutoProcessor.from_pretrained("badianeai/AnandaSky",
                                          trust_remote_code=True,
                                          local_files_only=True)

inputs = processor(images=image, return_tensors="pt")


inputs["input_ids"] = inputs["input_ids"].to(device=DEVICE, non_blocking=True)
inputs["attention_mask"] = inputs["attention_mask"].to(device=DEVICE, non_blocking=True)
inputs["pixel_values"] = inputs["pixel_values"].to(device=DEVICE, dtype=DTYPE, non_blocking=True)
inputs["patch_attention_mask"] = inputs["patch_attention_mask"].to(device=DEVICE, non_blocking=True)

with torch.no_grad():
    with torch.autocast(device_type="cuda", dtype=DTYPE, enabled=True):
        output = model.generate(
            **inputs,
            use_cache=True,
        )

text = processor.decode(output[0, 1:], skip_special_tokens=True).strip()
print(text)

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

It may be used for research, academic, and other non-commercial purposes only. Commercial use is not permitted without prior permission from the authors.

Citation

@inproceedings{brisson:hal-05548531,
  TITLE = {{AnandaSky: A Vision-Language Model for Line-Level Transcription of Historical Sinographic Documents}},
  AUTHOR = {Brisson, Colin and Kahfy, Ayoub and Constant, Fr{\'e}d{\'e}ric and Bui, Marc},
  URL = {https://hal.science/hal-05548531},
  NOTE = {BnF DataLab Projet READ\_Chinese},
  BOOKTITLE = {{The Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026)}},
  ADDRESS = {Majorca/Spain, Spain},
  YEAR = {2026},
  MONTH = May,
  KEYWORDS = {Dunhuang manuscripts ; long-tailed distribution ; vision-language models ; HTR ; OCR ; Classical Chinese ; Historical documents ; Historical documents Classical Chinese OCR HTR vision-language models long-tailed distribution Dunhuang manuscripts},
  PDF = {https://hal.science/hal-05548531v1/file/AnandaSky_Technical_Report.pdf},
  HAL_ID = {hal-05548531},
  HAL_VERSION = {v1},
}

Contact

For questions, bug reports, or feedback, please open an issue in the repository.

Downloads last month
56
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for badianeai/AnandaSky

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(706)
this model