tsumugi β model checkpoints
Code: https://github.com/yanagizawa-naoto/tsumugi
Encoderβdecoder for table detection / structure recognition / OCR.
Encoder: DINOv2 ViT-B/14. Decoder: from-scratch autoregressive
Transformer with ROI-pool conditioning at the <cls> decoder
position.
Key checkpoints:
| ckpt | best for | synth F1 | PubMed F1 | real OCR exact | size |
|---|---|---|---|---|---|
dino_pub8/ |
real OCR | β | 0.985 | 29.0% | 559MB |
dino_pub7/ |
OCR @ 518 input | 0.992 | 0.983 | 21.1% | 559MB |
dino_pub6/ |
detection IoU>=0.9 | 0.993 | 0.989 | 14.1% | 559MB |
dino_pub5/ |
first OCR-capable (synth only) | 0.996 | 0.979 | β | 558MB |
dino_pub4/ |
multi-domain (DocLayNet F1 0.64) | 0.995 | 0.987 | β | 556MB |
dino_pub3/ |
PubMed-best detection | 0.993 | 0.992 | β | 556MB |
dino_pub2/ |
+ Structure cells | 0.990 | 0.991 | β | 556MB |
dino_pub1/ |
first real-tabs detector | 0.991 | 0.979 | β | 556MB |
dino_v2/ |
DINOv2 unfrozen, synth-only | 0.990 | 0.271 | β | 556MB |
dino_v1/ |
DINOv2 frozen baseline | 0.987 | β | β | 556MB |
roi_v3/ |
pre-DINOv2, from-scratch ViT | 0.986 | β | β | 396MB |
roi_v2/ roi_run1/ run9/ run1/ β earlier from-scratch lineage |
Each <run>/ folder carries:
*_step*_model_only.ptβ slim checkpoint (model weights, args, vocab_size; no optimizer state). Loadable for eval/inference but not for resume-training.*_config.jsonβ exactargparseconfig used for that run.*_eval_*.logβ per-eval result text (PubTables Detection, Structure cell-level, OCR, synth, DocLayNet, etc.).
Loading
import torch
from huggingface_hub import hf_hub_download
from model.tokenizer import TsumugiTokenizer
from model.model_dino import TsumugiModelDINO # see github repo
ckpt_path = hf_hub_download("Naoto-ipu/tsumugi-models",
"dino_pub8/dino_pub8_step12000_model_only.pt")
ck = torch.load(ckpt_path, map_location="cuda", weights_only=False)
a = ck["args"]
tok = TsumugiTokenizer()
m = TsumugiModelDINO(
vocab_size=tok.vocab_size,
dec_layers=a["dec_layers"], num_heads=a["dec_heads"],
max_seq_len=a["max_len"], dropout=0.0, use_roi=True,
freeze_encoder=False,
image_size=a.get("encoder_image_size") or 518,
).to("cuda")
m.load_state_dict(ck["model"]); m.eval()
Lineage
The naming convention reflects the chronological progression:
run1βrun9: from-scratch ViT, scaling experimentsroi_run1βroi_v3: added ROI-pool conditioning, max_len, char fixesdino_v1βdino_v3: switched encoder to DINOv2 ViT-B/14, no-table trainingdino_pub1βdino_pub4: added PubTables-1M and DocLayNet real datadino_pub5βdino_pub8: added char-level OCR, real OCR at 686Γ686
See full per-run notes and metrics in the corresponding *_eval_*.log
files inside each subfolder, and the tsumugi GitHub README.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support