File size: 2,889 Bytes
ac79670 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: apache-2.0
language:
- zh
- en
tags:
- text-detection
- ocr
- dbnet
- repvit
- pytorch
datasets:
- chinese-text-detection
pipeline_tag: image-segmentation
---
# DBNet++ RepViT (Chinese)
Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on **Chinese text detection datasets**.
## Model Details
| Component | Configuration |
|-----------|--------------|
| Architecture | DBNet++ (Differentiable Binarization) |
| Backbone | RepViT (lightweight ViT-inspired CNN) |
| Neck | RSEFPN (in: [48, 96, 192, 384], out: 96) |
| Head | DBNetPPHead (inner: 24, k: 50) |
| Parameters | ~3M |
| Input Size | 640x640 (flexible) |
## Training Data
This model was converted from [OpenOCR](https://github.com/Topdu/OpenOCR) pretrained weights, trained on **Chinese text detection datasets**.
**Recommended datasets for fine-tuning:**
- MSRA-TD500 (Chinese + English)
- ICDAR2017 RCTW (Chinese)
- CTW1500
**Note:** For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended.
## Usage
### With Hugging Face
```python
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(
repo_id="thisisiron/dbnetpp_repvit_ch",
filename="dbnetpp_repvit_ch.pth"
)
# Load weights
state_dict = torch.load(model_path, map_location="cpu")
```
### With OCR-Factory
```python
import torch
from ocrfactory.models.detect import DBNetPP
# Build model
model = DBNetPP(
backbone={"name": "RepViT"},
neck={
"name": "RSEFPN",
"in_channels": [48, 96, 192, 384],
"out_channels": 96,
"shortcut": True
},
head={
"name": "DBNetPPHead",
"in_channels": 96,
"inner_channels": 24,
"k": 50,
"use_asf": False
}
)
# Load weights
state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu")
model.load_state_dict(state_dict, strict=True)
model.eval()
# Inference
x = torch.randn(1, 3, 640, 640)
with torch.no_grad():
output = model(x)
shrink_map = output["shrink_map"] # (1, 1, 640, 640)
```
### Training Config (YAML)
```yaml
architecture:
backbone:
name: RepViT
neck:
name: RSEFPN
in_channels: [48, 96, 192, 384]
out_channels: 96
shortcut: true
head:
name: DBNetPPHead
in_channels: 96
inner_channels: 24
k: 50
use_asf: false
```
## Performance
| Dataset | Precision | Recall | H-mean |
|---------|-----------|--------|--------|
| MSRA-TD500 | - | - | - |
*Performance metrics will be updated after benchmarking.*
## References
- **OpenOCR**: https://github.com/Topdu/OpenOCR
- **RepViT**: https://github.com/THU-MIG/RepViT
- **DBNet++**: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304)
## License
Apache 2.0
|