metadata
license: apache-2.0
language:
- zh
- en
tags:
- text-detection
- ocr
- dbnet
- repvit
- pytorch
datasets:
- chinese-text-detection
pipeline_tag: image-segmentation
DBNet++ RepViT (Chinese)
Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on Chinese text detection datasets.
Model Details
| Component | Configuration |
|---|---|
| Architecture | DBNet++ (Differentiable Binarization) |
| Backbone | RepViT (lightweight ViT-inspired CNN) |
| Neck | RSEFPN (in: [48, 96, 192, 384], out: 96) |
| Head | DBNetPPHead (inner: 24, k: 50) |
| Parameters | ~3M |
| Input Size | 640x640 (flexible) |
Training Data
This model was converted from OpenOCR pretrained weights, trained on Chinese text detection datasets.
Recommended datasets for fine-tuning:
- MSRA-TD500 (Chinese + English)
- ICDAR2017 RCTW (Chinese)
- CTW1500
Note: For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended.
Usage
With Hugging Face
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(
repo_id="thisisiron/dbnetpp_repvit_ch",
filename="dbnetpp_repvit_ch.pth"
)
# Load weights
state_dict = torch.load(model_path, map_location="cpu")
With OCR-Factory
import torch
from ocrfactory.models.detect import DBNetPP
# Build model
model = DBNetPP(
backbone={"name": "RepViT"},
neck={
"name": "RSEFPN",
"in_channels": [48, 96, 192, 384],
"out_channels": 96,
"shortcut": True
},
head={
"name": "DBNetPPHead",
"in_channels": 96,
"inner_channels": 24,
"k": 50,
"use_asf": False
}
)
# Load weights
state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu")
model.load_state_dict(state_dict, strict=True)
model.eval()
# Inference
x = torch.randn(1, 3, 640, 640)
with torch.no_grad():
output = model(x)
shrink_map = output["shrink_map"] # (1, 1, 640, 640)
Training Config (YAML)
architecture:
backbone:
name: RepViT
neck:
name: RSEFPN
in_channels: [48, 96, 192, 384]
out_channels: 96
shortcut: true
head:
name: DBNetPPHead
in_channels: 96
inner_channels: 24
k: 50
use_asf: false
Performance
| Dataset | Precision | Recall | H-mean |
|---|---|---|---|
| MSRA-TD500 | - | - | - |
Performance metrics will be updated after benchmarking.
References
- OpenOCR: https://github.com/Topdu/OpenOCR
- RepViT: https://github.com/THU-MIG/RepViT
- DBNet++: Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion
License
Apache 2.0