thisisiron
/

dbnetpp_repvit_ch

Image Segmentation

Model card Files Files and versions

dbnetpp_repvit_ch / README.md

thisisiron's picture

Update README.md

ac79670 verified 7 days ago

|

history blame contribute delete

2.89 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- text-detection
	- ocr
	- dbnet
	- repvit
	- pytorch
	datasets:
	- chinese-text-detection
	pipeline_tag: image-segmentation
	---

	# DBNet++ RepViT (Chinese)

	Lightweight text detection model combining DBNet++ with RepViT backbone, optimized for efficient inference. Pretrained on Chinese text detection datasets.

	## Model Details

	\| Component \| Configuration \|
	\|-----------\|--------------\|
	\| Architecture \| DBNet++ (Differentiable Binarization) \|
	\| Backbone \| RepViT (lightweight ViT-inspired CNN) \|
	\| Neck \| RSEFPN (in: [48, 96, 192, 384], out: 96) \|
	\| Head \| DBNetPPHead (inner: 24, k: 50) \|
	\| Parameters \| ~3M \|
	\| Input Size \| 640x640 (flexible) \|

	## Training Data

	This model was converted from [OpenOCR](https://github.com/Topdu/OpenOCR) pretrained weights, trained on Chinese text detection datasets.

	Recommended datasets for fine-tuning:
	- MSRA-TD500 (Chinese + English)
	- ICDAR2017 RCTW (Chinese)
	- CTW1500

	Note: For English-only text detection, fine-tuning on English datasets (ICDAR2015, Total-Text) is recommended.

	## Usage

	### With Hugging Face

	```python
	from huggingface_hub import hf_hub_download
	import torch

	# Download model
	model_path = hf_hub_download(
	repo_id="thisisiron/dbnetpp_repvit_ch",
	filename="dbnetpp_repvit_ch.pth"
	)

	# Load weights
	state_dict = torch.load(model_path, map_location="cpu")
	```

	### With OCR-Factory

	```python
	import torch
	from ocrfactory.models.detect import DBNetPP

	# Build model
	model = DBNetPP(
	backbone={"name": "RepViT"},
	neck={
	"name": "RSEFPN",
	"in_channels": [48, 96, 192, 384],
	"out_channels": 96,
	"shortcut": True
	},
	head={
	"name": "DBNetPPHead",
	"in_channels": 96,
	"inner_channels": 24,
	"k": 50,
	"use_asf": False
	}
	)

	# Load weights
	state_dict = torch.load("dbnetpp_repvit_ch.pth", map_location="cpu")
	model.load_state_dict(state_dict, strict=True)
	model.eval()

	# Inference
	x = torch.randn(1, 3, 640, 640)
	with torch.no_grad():
	output = model(x)
	shrink_map = output["shrink_map"] # (1, 1, 640, 640)
	```

	### Training Config (YAML)

	```yaml
	architecture:
	backbone:
	name: RepViT
	neck:
	name: RSEFPN
	in_channels: [48, 96, 192, 384]
	out_channels: 96
	shortcut: true
	head:
	name: DBNetPPHead
	in_channels: 96
	inner_channels: 24
	k: 50
	use_asf: false
	```

	## Performance

	\| Dataset \| Precision \| Recall \| H-mean \|
	\|---------\|-----------\|--------\|--------\|
	\| MSRA-TD500 \| - \| - \| - \|

	Performance metrics will be updated after benchmarking.

	## References

	- OpenOCR: https://github.com/Topdu/OpenOCR
	- RepViT: https://github.com/THU-MIG/RepViT
	- DBNet++: [Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion](https://arxiv.org/abs/2202.10304)

	## License

	Apache 2.0