--- license: apache-2.0 language: - zh - en tags: - text-detection - ocr - dbnet - repvit - pytorch datasets: - chinese-text-detection pipeline_tag: image-segmentation --- # DBNet++ RepViT (Chinese) Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on **Chinese text detection datasets**. ## Model Details | Component | Configuration | |-----------|--------------| | Architecture | DBNet++ (Differentiable Binarization) | | Backbone | RepViT (lightweight ViT-inspired CNN) | | Neck | RSEFPN (in: [48, 96, 192, 384], out: 96) | | Head | DBNetPPHead (inner: 24, k: 50) | | Parameters | ~3M | | Input Size | 640x640 (flexible) | ## Training Data This model was converted from [OpenOCR](https://github.com/Topdu/OpenOCR) pretrained weights, trained on **Chinese text detection datasets**. **Recommended datasets for fine-tuning:** - MSRA-TD500 (Chinese + English) - ICDAR2017 RCTW (Chinese) - CTW1500 **Note:** For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended. ## Usage ### With Hugging Face ```python from huggingface_hub import hf_hub_download import torch # Download model model_path = hf_hub_download( repo_id="thisisiron/dbnetpp_repvit_ch", filename="dbnetpp_repvit_ch.pth" ) # Load weights state_dict = torch.load(model_path, map_location="cpu") ``` ### With OCR-Factory ```python import torch from ocrfactory.models.detect import DBNetPP # Build model model = DBNetPP( backbone={"name": "RepViT"}, neck={ "name": "RSEFPN", "in_channels": [48, 96, 192, 384], "out_channels": 96, "shortcut": True }, head={ "name": "DBNetPPHead", "in_channels": 96, "inner_channels": 24, "k": 50, "use_asf": False } ) # Load weights state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu") model.load_state_dict(state_dict, strict=True) model.eval() # Inference x = torch.randn(1, 3, 640, 640) with torch.no_grad(): output = model(x) shrink_map = output["shrink_map"] # (1, 1, 640, 640) ``` ### Training Config (YAML) ```yaml architecture: backbone: name: RepViT neck: name: RSEFPN in_channels: [48, 96, 192, 384] out_channels: 96 shortcut: true head: name: DBNetPPHead in_channels: 96 inner_channels: 24 k: 50 use_asf: false ``` ## Performance | Dataset | Precision | Recall | H-mean | |---------|-----------|--------|--------| | MSRA-TD500 | - | - | - | *Performance metrics will be updated after benchmarking.* ## References - **OpenOCR**: https://github.com/Topdu/OpenOCR - **RepViT**: https://github.com/THU-MIG/RepViT - **DBNet++**: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304) ## License Apache 2.0