AnandaSky
AnandaSky is a vision-language model for line-level transcription of historical sinographic documents.
The name combines Ananda—the disciple of the Buddha traditionally associated with the "encoding" of early Buddhist texts—and Sky, the opening character of the Thousand Character Classic, a text long used in premodern China to enumerate items.
Paper
This model is described in the following paper:
AnandaSky: A Vision--Language Model for Line-Level Transcription of Historical Sinographic Documents
Model Overview
AnandaSky is a vision-language model for efficient transcription of historical sinographic line images. It contains approximately 626M parameters and combines:
- a Vision Transformer (ViT) encoder
- an autoregressive Qwen3-based decoder
The model was trained on 4 million line images extracted from historical documents produced in China and Korea between the 8th and 20th centuries, including both printed editions and handwritten manuscripts.
A full description of the datasets, preprocessing pipeline, and training procedure is provided in the accompanying paper.
Evaluation Results
The model achieves the following character error rates (CER) on in-domain test sets:
| Dataset | CER |
|---|---|
| MTHv2 | 0.92% |
| Sibu Congkan | 0.43% |
| Korean Anthologies | 0.33% |
| Dunhuang Manuscripts | 1.38% |
| Qing Legal Documents | 4.89% |
The model achieves the following character error rates (CER) on held-out benchmarks:
| Dataset | CER |
|---|---|
| ICDAR2019-HDRC | 0.96% |
| CUHK Challenge 2021 | 0.82% |
| CUHK Challenge 2022 | 1.61% |
Intended Use
AnandaSky is intended for line-level transcription of historical sinographic documents. It can process both single-column and double-column vertical text layouts.
Transcription Normalization
If you notice that the model systematically produces an incorrect transcription for a specific character or glyph form, please consider opening an issue in the repository. Such reports are valuable for improving the normalization pipeline and future model releases.
Hardware and Dependencies
This model has a hard dependency on FlashAttention.
Required Environment
- Python >= 3.10
- PyTorch >= 2.1
- NVIDIA Ampere-or-newer GPU
transformersflash-attn
Install FlashAttention
pip install flash-attn --no-build-isolation
⚠️ FlashAttention is required. The model will not run without it.
Loading the Model
Because this repository uses custom Transformers modeling code, the model must be loaded with trust_remote_code=True.
Example
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"badianeai/AnandaSky",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
)
Minimal Inference Example
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
DEVICE = torch.device("cuda")
DTYPE = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained("badianeai/AnandaSky",
trust_remote_code=True,
torch_dtype=DTYPE)
model = model.to(DEVICE)
image = Image.open("line_image.png")
processor = AutoProcessor.from_pretrained("badianeai/AnandaSky",
trust_remote_code=True,
local_files_only=True)
inputs = processor(images=image, return_tensors="pt")
inputs["input_ids"] = inputs["input_ids"].to(device=DEVICE, non_blocking=True)
inputs["attention_mask"] = inputs["attention_mask"].to(device=DEVICE, non_blocking=True)
inputs["pixel_values"] = inputs["pixel_values"].to(device=DEVICE, dtype=DTYPE, non_blocking=True)
inputs["patch_attention_mask"] = inputs["patch_attention_mask"].to(device=DEVICE, non_blocking=True)
with torch.no_grad():
with torch.autocast(device_type="cuda", dtype=DTYPE, enabled=True):
output = model.generate(
**inputs,
use_cache=True,
)
text = processor.decode(output[0, 1:], skip_special_tokens=True).strip()
print(text)
License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
It may be used for research, academic, and other non-commercial purposes only. Commercial use is not permitted without prior permission from the authors.
Citation
@inproceedings{brisson:hal-05548531,
TITLE = {{AnandaSky: A Vision-Language Model for Line-Level Transcription of Historical Sinographic Documents}},
AUTHOR = {Brisson, Colin and Kahfy, Ayoub and Constant, Fr{\'e}d{\'e}ric and Bui, Marc},
URL = {https://hal.science/hal-05548531},
NOTE = {BnF DataLab Projet READ\_Chinese},
BOOKTITLE = {{The Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026)}},
ADDRESS = {Majorca/Spain, Spain},
YEAR = {2026},
MONTH = May,
KEYWORDS = {Dunhuang manuscripts ; long-tailed distribution ; vision-language models ; HTR ; OCR ; Classical Chinese ; Historical documents ; Historical documents Classical Chinese OCR HTR vision-language models long-tailed distribution Dunhuang manuscripts},
PDF = {https://hal.science/hal-05548531v1/file/AnandaSky_Technical_Report.pdf},
HAL_ID = {hal-05548531},
HAL_VERSION = {v1},
}
Contact
For questions, bug reports, or feedback, please open an issue in the repository.
- Downloads last month
- 56