tsumugi — model checkpoints

Code: https://github.com/yanagizawa-naoto/tsumugi

Encoder–decoder for table detection / structure recognition / OCR. Encoder: DINOv2 ViT-B/14. Decoder: from-scratch autoregressive Transformer with ROI-pool conditioning at the <cls> decoder position.

Key checkpoints:

ckpt	best for	synth F1	PubMed F1	real OCR exact	size
`dino_pub8/`	real OCR	—	0.985	29.0%	559MB
`dino_pub7/`	OCR @ 518 input	0.992	0.983	21.1%	559MB
`dino_pub6/`	detection IoU>=0.9	0.993	0.989	14.1%	559MB
`dino_pub5/`	first OCR-capable (synth only)	0.996	0.979	—	558MB
`dino_pub4/`	multi-domain (DocLayNet F1 0.64)	0.995	0.987	—	556MB
`dino_pub3/`	PubMed-best detection	0.993	0.992	—	556MB
`dino_pub2/`	+ Structure cells	0.990	0.991	—	556MB
`dino_pub1/`	first real-tabs detector	0.991	0.979	—	556MB
`dino_v2/`	DINOv2 unfrozen, synth-only	0.990	0.271	—	556MB
`dino_v1/`	DINOv2 frozen baseline	0.987	—	—	556MB
`roi_v3/`	pre-DINOv2, from-scratch ViT	0.986	—	—	396MB
`roi_v2/` `roi_run1/` `run9/` `run1/` — earlier from-scratch lineage

Each <run>/ folder carries:

*_step*_model_only.pt — slim checkpoint (model weights, args, vocab_size; no optimizer state). Loadable for eval/inference but not for resume-training.
*_config.json — exact argparse config used for that run.
*_eval_*.log — per-eval result text (PubTables Detection, Structure cell-level, OCR, synth, DocLayNet, etc.).

Loading

import torch
from huggingface_hub import hf_hub_download
from model.tokenizer import TsumugiTokenizer
from model.model_dino import TsumugiModelDINO  # see github repo

ckpt_path = hf_hub_download("Naoto-ipu/tsumugi-models",
                            "dino_pub8/dino_pub8_step12000_model_only.pt")
ck = torch.load(ckpt_path, map_location="cuda", weights_only=False)
a = ck["args"]
tok = TsumugiTokenizer()
m = TsumugiModelDINO(
    vocab_size=tok.vocab_size,
    dec_layers=a["dec_layers"], num_heads=a["dec_heads"],
    max_seq_len=a["max_len"], dropout=0.0, use_roi=True,
    freeze_encoder=False,
    image_size=a.get("encoder_image_size") or 518,
).to("cuda")
m.load_state_dict(ck["model"]); m.eval()

Lineage

The naming convention reflects the chronological progression:

run1 → run9: from-scratch ViT, scaling experiments
roi_run1 → roi_v3: added ROI-pool conditioning, max_len, char fixes
dino_v1 → dino_v3: switched encoder to DINOv2 ViT-B/14, no-table training
dino_pub1 → dino_pub4: added PubTables-1M and DocLayNet real data
dino_pub5 → dino_pub8: added char-level OCR, real OCR at 686×686

See full per-run notes and metrics in the corresponding *_eval_*.log files inside each subfolder, and the tsumugi GitHub README.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support