WATERec-Models: Strong Baseline for WordArt-Oriented Scene Text Recognition
WATERec is the strong STR baseline proposed in the paper "Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods" (ECCV 2026). It couples a NaViT-like RoPE-ViT encoder that supports arbitrary-shaped inputs with an autoregressive (AR) Transformer decoder, structurally breaking the bottleneck of fixed-template STR on highly irregular WordArt.
This repository hosts the trained model checkpoints.
- ๐ Paper (arXiv): https://arxiv.org/abs/2606.24484
- ๐ป Code: https://github.com/YesianRohn/WATER
- ๐ง Model code (OpenOCR-WATERec): https://github.com/YesianRohn/OpenOCR-WATERec
- ๐ฆ Datasets (WATER-Data): https://huggingface.co/datasets/Yesianrohn/WATER-Data
Model Architecture
- Encoder: 6-layer Transformer with RoPE attention, accepting arbitrary aspect ratios.
Inputs are rescaled (aspect-ratio preserving) so the number of
4ร4patch tokens lies in[64, 256]; tokens are projected tod=384and arranged in row-major order. - Decoder: 2 cross-attention AR Transformer layers, predicting characters one by one under cross-entropy loss. Max text length 25; character set of 94 tokens (digits, letters, common symbols).
This design preserves native aspect ratios, mitigates distortion from fixed-template resizing, and better adapts to curved / vertical / multi-oriented artistic layouts.
Checkpoints
Each file is a standard PyTorch state_dict (~112 MB), differing only in the training data:
| File | Training data | WordArt-Bench Acc. |
|---|---|---|
WATERec-R.pth |
WATER-R (real only, 3.2M) | 88.55% |
WATERec-S.pth |
WATER-S (synthetic only, 2M) | 80.94% |
WATERec-RS.pth |
WATER-R + WATER-S (real + 2M synthetic) | 90.40% |
WATERec-RS.pth is the recommended best model โ the first result to exceed 90% on WordArt-Bench,
surpassing both general-purpose and OCR-specialized VLMs by a large margin.
Usage
We recommend running these checkpoints with the official framework OpenOCR-WATERec, which provides the matching model configuration, preprocessing, and inference scripts.
Download the weights:
# Requires: pip install -U "huggingface_hub[cli]"
hf download Yesianrohn/WATERec-Models --local-dir ./WATERec-Models
Load a checkpoint:
import torch
# weights_only=True for safer loading of pickle-based .pth files
state_dict = torch.load("WATERec-RS.pth", map_location="cpu", weights_only=True)
# Build the WATERec model from the OpenOCR-WATERec config, then:
# model.load_state_dict(state_dict)
These
.pthfiles contain only model weights; no config is bundled. Use the configs in the OpenOCR-WATERec repository to instantiate the architecture before loading the state dict.
License
Released under the Apache 2.0 license.
Citation
If you use these models in your research, please cite our paper:
@inproceedings{water2026eccv,
title = {Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods},
author = {Ye, Xingsong and Du, Yongkun and Zhang, Jiaxin and Zhang, Haojie and Sun, Chong and Li, Chen and Lyu, Jing and Chen, Zhineng},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}