Upload 20 files

646f45c verified 4 months ago

7.97 kB

	---
	license: apache-2.0
	language:
	- en
	- vi
	pipeline_tag: image-to-text
	model-index:
	- name: HTR-ConvText
	results:
	- task:
	type: image-to-text
	name: Handwritten Text Recognition
	dataset:
	name: IAM
	type: iam
	split: test
	metrics:
	- type: cer
	value: 4.0
	name: Test CER
	- type: wer
	value: 12.9
	name: Test WER
	- task:
	type: image-to-text
	name: Handwritten Text Recognition
	dataset:
	name: LAM
	type: lam
	split: test
	metrics:
	- type: cer
	value: 2.7
	name: Test CER
	- type: wer
	value: 7.0
	name: Test WER
	- task:
	type: image-to-text
	name: Handwritten Text Recognition
	dataset:
	name: READ2016
	type: read2016
	split: test
	metrics:
	- type: cer
	value: 3.6
	name: Test CER
	- type: wer
	value: 15.7
	name: Test WER
	- task:
	type: image-to-text
	name: Handwritten Text Recognition
	dataset:
	name: HANDS-VNOnDB
	type: hands-vnondb
	split: test
	metrics:
	- type: cer
	value: 3.45
	name: Test CER
	- type: wer
	value: 8.9
	name: Test WER
	---
	---
	# HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition

	<div align="center"> <img src="image/architecture.png" alt="HTR-ConvText Architecture" width="800"/> </div>

	<p align="center">
	<a href="https://huggingface.co/DAIR-Group/HTR-ConvText">
	<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue">
	</a>
	<a href="https://github.com/DAIR-Group/HTR-ConvText">
	<img alt="GitHub" src="https://img.shields.io/badge/GitHub-Repo-181717.svg?logo=github&logoColor=white">
	</a>
	<a href="https://github.com/DAIR-Group/HTR-ConvText/blob/main/LICENSE">
	<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green">
	</a>
	<a href="https://arxiv.org/abs/2512.05021">
	<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2512.05021-b31b1b.svg">
	</a>
	</p>

	## Highlights

	HTR-ConvText is a novel hybrid architecture for Handwritten Text Recognition (HTR) that effectively balances local feature extraction with global contextual modeling. Designed to overcome the limitations of standard CTC-based decoding and data-hungry Transformers, HTR-ConvText delivers state-of-the-art performance with the following key features:

	- Hybrid CNN-ViT Architecture: Seamlessly integrates a ResNet backbone with MobileViT blocks (MVP) and Conditional Positional Encoding, enabling the model to capture fine-grained stroke details while maintaining global spatial awareness.
	- Hierarchical ConvText Encoder: A U-Net-like encoder structure that interleaves Multi-Head Self-Attention with Depthwise Convolutions. This design efficiently models both long-range dependencies and local structural patterns.
	- Textual Context Module (TCM): An innovative training-only auxiliary module that injects bidirectional linguistic priors into the visual encoder. This mitigates the conditional independence weakness of CTC decoding without adding any latency during inference.
	- State-of-the-Art Performance: Outperforms existing methods on major benchmarks including IAM (English), READ2016 (German), LAM (Italian), and HANDS-VNOnDB (Vietnamese), specifically excelling in low-resource scenarios and complex diacritics.

	## Model Overview

	HTR-ConvText configurations and specifications:

	\| Feature \| Specification \|
	\| ------------------- \| --------------------------------------------------- \|
	\| Architecture Type \| Hybrid CNN + Vision Transformer (Encoder-Only) \|
	\| Parameters \| ~65.9M \|
	\| Backbone \| ResNet-18 + MobileViT w/ Positional Encoding (MVP) \|
	\| Encoder Layers \| 8 ConvText Blocks (Hierarchical) \|
	\| Attention Heads \| 8 \|
	\| Embedding Dimension \| 512 \|
	\| Image Input Size \| 512×64 \|
	\| Inference Strategy \| Standard CTC Decoding (TCM is removed at inference) \|

	For more details, including ablation studies and theoretical proofs, please refer to our [Technical Report](https://arxiv.org/pdf/2512.05021).

	## Performance

	We evaluated HTR-ConvText across four diverse datasets. The model achieves new SOTA results with the lowest Character Error Rate (CER) and Word Error Rate (WER) without requiring massive synthetic pre-training.

	\| Dataset \| Language \| Ours CER (%) \| HTR-VT \| OrigamiNet \| TrOCR \| CRNN \|
	\|-----------\|-------------\|--------------\|--------\|------------\|-------\|-------\|
	\| IAM \| English \| 4.0 \| 4.7 \| 4.8 \| 7.3 \| 7.8 \|
	\| LAM \| Italian \| 2.7 \| 2.8 \| 3.0 \| 3.6 \| 3.8 \|
	\| READ2016 \| German \| 3.6 \| 3.9 \| - \| - \| 4.7 \|
	\| VNOnDB \| Vietnamese \| 3.45 \| 4.26 \| 7.6 \| - \| 10.53 \|

	## Quickstart

	### Instalation

	1. Clone the repository
	```cmd
	git clone https://github.com/0xk0ry/HTR-ConvText.git
	cd HTR-ConvText
	```
	2. Create and activate a Python 3.9+ Conda environment
	```cmd
	conda create -n htr-convtext python=3.9 -y
	conda activate htr-convtext
	```
	3. Install PyTorch using the wheel that matches your CUDA driver (swap the index for CPU-only builds):
	```cmd
	pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
	```
	4. Install the remaining project requirements (everything except PyTorch, which you already picked in step 3).
	```cmd
	pip install -r requirements.txt
	```

	The code was tested on Python 3.9 and PyTorch 2.9.1.

	### Data Preparation

	We provide split files (train.ln, val.ln, test.ln) for IAM, READ2016, LAM, and VNOnDB under data/. Organize your data as follows:

	```
	./data/iam/
	├── train.ln
	├── val.ln
	├── test.ln
	└── lines
	├── a01-000u-00.png
	├── a01-000u-00.txt
	└── ...
	```

	### Training

	We provide comprehensive scripts in the ./run/ directory. To train on the IAM dataset with the Textual Context Module (TCM) enabled:

	```
	# Using the provided script
	bash run/iam.sh

	# OR running directly via Python
	python train.py \
	--use-wandb \
	--dataset iam \
	--tcm-enable \
	--exp-name "htr-convtext-iam" \
	--img-size 512 64 \
	--train-bs 32 \
	--val-bs 8 \
	--data-path /path/to/iam/lines/ \
	--train-data-list data/iam/train.ln \
	--val-data-list data/iam/val.ln \
	--test-data-list data/iam/test.ln \
	--nb-cls 80
	```

	### Inference / Evaluation

	To evaluate a pre-trained checkpoint on the test set:

	```
	python test.py \
	--resume ./checkpoints/best_CER.pth \
	--dataset iam \
	--img-size 512 64 \
	--data-path /path/to/iam/lines/ \
	--test-data-list data/iam/test.ln \
	--nb-cls 80
	```

	## Citation

	If you find our work helpful, please cite our paper:

	```
	@misc{truc2025htrconvtex,
	title={HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition},
	author={Pham Thach Thanh Truc and Dang Hoai Nam and Huynh Tong Dang Khoa and Vo Nguyen Le Duy},
	year={2025},
	eprint={2512.05021},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2512.05021},
	}
	```

	## Acknowledgement

	This project is inspired by and adapted from [HTR-VT](https://github.com/Intellindust-AI-Lab/HTR-VT). We gratefully acknowledge the authors for their open-source contributions.