|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
tags: |
|
|
- text-detection |
|
|
- ocr |
|
|
- dbnet |
|
|
- repvit |
|
|
- pytorch |
|
|
datasets: |
|
|
- chinese-text-detection |
|
|
pipeline_tag: image-segmentation |
|
|
--- |
|
|
|
|
|
# DBNet++ RepViT (Chinese) |
|
|
|
|
|
Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on **Chinese text detection datasets**. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Component | Configuration | |
|
|
|-----------|--------------| |
|
|
| Architecture | DBNet++ (Differentiable Binarization) | |
|
|
| Backbone | RepViT (lightweight ViT-inspired CNN) | |
|
|
| Neck | RSEFPN (in: [48, 96, 192, 384], out: 96) | |
|
|
| Head | DBNetPPHead (inner: 24, k: 50) | |
|
|
| Parameters | ~3M | |
|
|
| Input Size | 640x640 (flexible) | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
This model was converted from [OpenOCR](https://github.com/Topdu/OpenOCR) pretrained weights, trained on **Chinese text detection datasets**. |
|
|
|
|
|
**Recommended datasets for fine-tuning:** |
|
|
- MSRA-TD500 (Chinese + English) |
|
|
- ICDAR2017 RCTW (Chinese) |
|
|
- CTW1500 |
|
|
|
|
|
**Note:** For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Hugging Face |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import torch |
|
|
|
|
|
# Download model |
|
|
model_path = hf_hub_download( |
|
|
repo_id="thisisiron/dbnetpp_repvit_ch", |
|
|
filename="dbnetpp_repvit_ch.pth" |
|
|
) |
|
|
|
|
|
# Load weights |
|
|
state_dict = torch.load(model_path, map_location="cpu") |
|
|
``` |
|
|
|
|
|
### With OCR-Factory |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from ocrfactory.models.detect import DBNetPP |
|
|
|
|
|
# Build model |
|
|
model = DBNetPP( |
|
|
backbone={"name": "RepViT"}, |
|
|
neck={ |
|
|
"name": "RSEFPN", |
|
|
"in_channels": [48, 96, 192, 384], |
|
|
"out_channels": 96, |
|
|
"shortcut": True |
|
|
}, |
|
|
head={ |
|
|
"name": "DBNetPPHead", |
|
|
"in_channels": 96, |
|
|
"inner_channels": 24, |
|
|
"k": 50, |
|
|
"use_asf": False |
|
|
} |
|
|
) |
|
|
|
|
|
# Load weights |
|
|
state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu") |
|
|
model.load_state_dict(state_dict, strict=True) |
|
|
model.eval() |
|
|
|
|
|
# Inference |
|
|
x = torch.randn(1, 3, 640, 640) |
|
|
with torch.no_grad(): |
|
|
output = model(x) |
|
|
shrink_map = output["shrink_map"] # (1, 1, 640, 640) |
|
|
``` |
|
|
|
|
|
### Training Config (YAML) |
|
|
|
|
|
```yaml |
|
|
architecture: |
|
|
backbone: |
|
|
name: RepViT |
|
|
neck: |
|
|
name: RSEFPN |
|
|
in_channels: [48, 96, 192, 384] |
|
|
out_channels: 96 |
|
|
shortcut: true |
|
|
head: |
|
|
name: DBNetPPHead |
|
|
in_channels: 96 |
|
|
inner_channels: 24 |
|
|
k: 50 |
|
|
use_asf: false |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Dataset | Precision | Recall | H-mean | |
|
|
|---------|-----------|--------|--------| |
|
|
| MSRA-TD500 | - | - | - | |
|
|
|
|
|
*Performance metrics will be updated after benchmarking.* |
|
|
|
|
|
## References |
|
|
|
|
|
- **OpenOCR**: https://github.com/Topdu/OpenOCR |
|
|
- **RepViT**: https://github.com/THU-MIG/RepViT |
|
|
- **DBNet++**: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304) |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|