tsumugi β€” model checkpoints

Code: https://github.com/yanagizawa-naoto/tsumugi

Encoder–decoder for table detection / structure recognition / OCR. Encoder: DINOv2 ViT-B/14. Decoder: from-scratch autoregressive Transformer with ROI-pool conditioning at the <cls> decoder position.

Key checkpoints:

ckpt best for synth F1 PubMed F1 real OCR exact size
dino_pub8/ real OCR β€” 0.985 29.0% 559MB
dino_pub7/ OCR @ 518 input 0.992 0.983 21.1% 559MB
dino_pub6/ detection IoU>=0.9 0.993 0.989 14.1% 559MB
dino_pub5/ first OCR-capable (synth only) 0.996 0.979 β€” 558MB
dino_pub4/ multi-domain (DocLayNet F1 0.64) 0.995 0.987 β€” 556MB
dino_pub3/ PubMed-best detection 0.993 0.992 β€” 556MB
dino_pub2/ + Structure cells 0.990 0.991 β€” 556MB
dino_pub1/ first real-tabs detector 0.991 0.979 β€” 556MB
dino_v2/ DINOv2 unfrozen, synth-only 0.990 0.271 β€” 556MB
dino_v1/ DINOv2 frozen baseline 0.987 β€” β€” 556MB
roi_v3/ pre-DINOv2, from-scratch ViT 0.986 β€” β€” 396MB
roi_v2/ roi_run1/ run9/ run1/ β€” earlier from-scratch lineage

Each <run>/ folder carries:

  • *_step*_model_only.pt β€” slim checkpoint (model weights, args, vocab_size; no optimizer state). Loadable for eval/inference but not for resume-training.
  • *_config.json β€” exact argparse config used for that run.
  • *_eval_*.log β€” per-eval result text (PubTables Detection, Structure cell-level, OCR, synth, DocLayNet, etc.).

Loading

import torch
from huggingface_hub import hf_hub_download
from model.tokenizer import TsumugiTokenizer
from model.model_dino import TsumugiModelDINO  # see github repo

ckpt_path = hf_hub_download("Naoto-ipu/tsumugi-models",
                            "dino_pub8/dino_pub8_step12000_model_only.pt")
ck = torch.load(ckpt_path, map_location="cuda", weights_only=False)
a = ck["args"]
tok = TsumugiTokenizer()
m = TsumugiModelDINO(
    vocab_size=tok.vocab_size,
    dec_layers=a["dec_layers"], num_heads=a["dec_heads"],
    max_seq_len=a["max_len"], dropout=0.0, use_roi=True,
    freeze_encoder=False,
    image_size=a.get("encoder_image_size") or 518,
).to("cuda")
m.load_state_dict(ck["model"]); m.eval()

Lineage

The naming convention reflects the chronological progression:

  • run1 β†’ run9: from-scratch ViT, scaling experiments
  • roi_run1 β†’ roi_v3: added ROI-pool conditioning, max_len, char fixes
  • dino_v1 β†’ dino_v3: switched encoder to DINOv2 ViT-B/14, no-table training
  • dino_pub1 β†’ dino_pub4: added PubTables-1M and DocLayNet real data
  • dino_pub5 β†’ dino_pub8: added char-level OCR, real OCR at 686Γ—686

See full per-run notes and metrics in the corresponding *_eval_*.log files inside each subfolder, and the tsumugi GitHub README.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support