File size: 2,889 Bytes
ac79670
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
language:
- zh
- en
tags:
- text-detection
- ocr
- dbnet
- repvit
- pytorch
datasets:
- chinese-text-detection
pipeline_tag: image-segmentation
---

# DBNet++ RepViT (Chinese)

Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on **Chinese text detection datasets**.

## Model Details

| Component | Configuration |
|-----------|--------------|
| Architecture | DBNet++ (Differentiable Binarization) |
| Backbone | RepViT (lightweight ViT-inspired CNN) |
| Neck | RSEFPN (in: [48, 96, 192, 384], out: 96) |
| Head | DBNetPPHead (inner: 24, k: 50) |
| Parameters | ~3M |
| Input Size | 640x640 (flexible) |

## Training Data

This model was converted from [OpenOCR](https://github.com/Topdu/OpenOCR) pretrained weights, trained on **Chinese text detection datasets**.

**Recommended datasets for fine-tuning:**
- MSRA-TD500 (Chinese + English)
- ICDAR2017 RCTW (Chinese)
- CTW1500

**Note:** For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended.

## Usage

### With Hugging Face

```python
from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="thisisiron/dbnetpp_repvit_ch",
    filename="dbnetpp_repvit_ch.pth"
)

# Load weights
state_dict = torch.load(model_path, map_location="cpu")
```

### With OCR-Factory

```python
import torch
from ocrfactory.models.detect import DBNetPP

# Build model
model = DBNetPP(
    backbone={"name": "RepViT"},
    neck={
        "name": "RSEFPN",
        "in_channels": [48, 96, 192, 384],
        "out_channels": 96,
        "shortcut": True
    },
    head={
        "name": "DBNetPPHead",
        "in_channels": 96,
        "inner_channels": 24,
        "k": 50,
        "use_asf": False
    }
)

# Load weights
state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu")
model.load_state_dict(state_dict, strict=True)
model.eval()

# Inference
x = torch.randn(1, 3, 640, 640)
with torch.no_grad():
    output = model(x)
    shrink_map = output["shrink_map"]  # (1, 1, 640, 640)
```

### Training Config (YAML)

```yaml
architecture:
  backbone:
    name: RepViT
  neck:
    name: RSEFPN
    in_channels: [48, 96, 192, 384]
    out_channels: 96
    shortcut: true
  head:
    name: DBNetPPHead
    in_channels: 96
    inner_channels: 24
    k: 50
    use_asf: false
```

## Performance

| Dataset | Precision | Recall | H-mean |
|---------|-----------|--------|--------|
| MSRA-TD500 | - | - | - |

*Performance metrics will be updated after benchmarking.*

## References

- **OpenOCR**: https://github.com/Topdu/OpenOCR
- **RepViT**: https://github.com/THU-MIG/RepViT
- **DBNet++**: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304)

## License

Apache 2.0