| --- |
| license: apache-2.0 |
| language: |
| - zh |
| - en |
| tags: |
| - text-detection |
| - ocr |
| - dbnet |
| - repvit |
| - pytorch |
| datasets: |
| - chinese-text-detection |
| pipeline_tag: image-segmentation |
| --- |
| |
| # DBNet++ RepViT (Chinese) |
|
|
| Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on **Chinese text detection datasets**. |
|
|
| ## Model Details |
|
|
| | Component | Configuration | |
| |-----------|--------------| |
| | Architecture | DBNet++ (Differentiable Binarization) | |
| | Backbone | RepViT (lightweight ViT-inspired CNN) | |
| | Neck | RSEFPN (in: [48, 96, 192, 384], out: 96) | |
| | Head | DBNetPPHead (inner: 24, k: 50) | |
| | Parameters | ~3M | |
| | Input Size | 640x640 (flexible) | |
|
|
| ## Training Data |
|
|
| This model was converted from [OpenOCR](https://github.com/Topdu/OpenOCR) pretrained weights, trained on **Chinese text detection datasets**. |
|
|
| **Recommended datasets for fine-tuning:** |
| - MSRA-TD500 (Chinese + English) |
| - ICDAR2017 RCTW (Chinese) |
| - CTW1500 |
|
|
| **Note:** For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended. |
|
|
| ## Usage |
|
|
| ### With Hugging Face |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import torch |
| |
| # Download model |
| model_path = hf_hub_download( |
| repo_id="thisisiron/dbnetpp_repvit_ch", |
| filename="dbnetpp_repvit_ch.pth" |
| ) |
| |
| # Load weights |
| state_dict = torch.load(model_path, map_location="cpu") |
| ``` |
|
|
| ### With OCR-Factory |
|
|
| ```python |
| import torch |
| from ocrfactory.models.detect import DBNetPP |
| |
| # Build model |
| model = DBNetPP( |
| backbone={"name": "RepViT"}, |
| neck={ |
| "name": "RSEFPN", |
| "in_channels": [48, 96, 192, 384], |
| "out_channels": 96, |
| "shortcut": True |
| }, |
| head={ |
| "name": "DBNetPPHead", |
| "in_channels": 96, |
| "inner_channels": 24, |
| "k": 50, |
| "use_asf": False |
| } |
| ) |
| |
| # Load weights |
| state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu") |
| model.load_state_dict(state_dict, strict=True) |
| model.eval() |
| |
| # Inference |
| x = torch.randn(1, 3, 640, 640) |
| with torch.no_grad(): |
| output = model(x) |
| shrink_map = output["shrink_map"] # (1, 1, 640, 640) |
| ``` |
|
|
| ### Training Config (YAML) |
|
|
| ```yaml |
| architecture: |
| backbone: |
| name: RepViT |
| neck: |
| name: RSEFPN |
| in_channels: [48, 96, 192, 384] |
| out_channels: 96 |
| shortcut: true |
| head: |
| name: DBNetPPHead |
| in_channels: 96 |
| inner_channels: 24 |
| k: 50 |
| use_asf: false |
| ``` |
|
|
| ## Performance |
|
|
| | Dataset | Precision | Recall | H-mean | |
| |---------|-----------|--------|--------| |
| | MSRA-TD500 | - | - | - | |
|
|
| *Performance metrics will be updated after benchmarking.* |
|
|
| ## References |
|
|
| - **OpenOCR**: https://github.com/Topdu/OpenOCR |
| - **RepViT**: https://github.com/THU-MIG/RepViT |
| - **DBNet++**: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304) |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|