KHLR / README.md

Upload README.md with huggingface_hub

1bf8c0b verified 2 months ago

4.76 kB

	---
	language:
	- ckb
	- ar
	- ur
	license: cc-by-nc-4.0
	tags:
	- handwritten-text-recognition
	- kurdish
	- arabic
	- urdu
	- densenet
	- transformer
	- pytorch
	- safetensors
	datasets:
	- DASTNUS
	- KHATT
	- PUCIT
	metrics:
	- cer
	- wer
	pipeline_tag: image-to-text
	---

	# KHLR: Kurdish Handwritten Line Recognition

	A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation

	This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.

	---

	## Repository Structure

	```
	KHLR/
	├── Kurdish-HLR-Model/ # Best Kurdish model (safetensors + config)
	├── Arabic-HLR-Model/ # Fine-tuned on KHATT Arabic dataset
	├── Urdu-HLR-Model/ # Fine-tuned on PUCIT Urdu dataset
	├── Scripts/
	│ ├── train.py # Main training script
	│ ├── synthetic_line_generator.py # Recipe-based synthetic line generation
	│ └── inference.py # Single image / batch inference
	├── Sample/
	│ ├── sample_image.tif # Example handwritten line image
	│ └── sample_image.txt # Corresponding ground truth
	├── requirements.txt
	└── README.md
	```

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| CNN Backbone \| DenseNet-121 (ImageNet pre-trained) \|
	\| Encoder \| 3 Transformer encoder layers \|
	\| Decoder \| 3 Transformer decoder layers \|
	\| Attention Heads \| 8 \|
	\| Hidden Size \| 256 \|
	\| Feed-Forward Dim \| 1024 \|
	\| Total Parameters \| ~12.8M \|

	## Performance

	### Kurdish (DASTNUS)

	\| Configuration \| CER \| WER \| CRR (%) \|
	\|--------------\|-----\|-----\|---------\|
	\| +AA+SKHL+FHL-50 \| 0.0593 \| 0.3083 \| 94.07 \|
	\| +AA+SKHL+FHL-50 + 8-gram LM \| 0.0534 \| 0.2746 \| 94.66 \|

	### Cross-Dataset Generalization

	\| Dataset \| Language \| CER \| WER \| CRR (%) \|
	\|---------\|----------\|-----\|-----\|---------\|
	\| KHATT \| Arabic \| 0.1135 \| 0.4156 \| 88.65 \|
	\| PUCIT \| Urdu \| 0.0932 \| 0.2799 \| 90.68 \|

	## Installation

	```bash
	git clone https://huggingface.co/karez/KHLR
	cd KHLR
	pip install -r requirements.txt
	```

	## Quick Start

	### Inference

	```bash
	# Single image (using .pth checkpoint)
	python Scripts/inference.py \
	--image Sample/sample_image.tif \
	--model_path Kurdish-HLR-Model/model.safetensors \
	--vocab_path Kurdish-HLR-Model/vocab.json

	# Directory of images
	python Scripts/inference.py \
	--image_dir ./test_images \
	--model_path Kurdish-HLR-Model/model.safetensors \
	--vocab_path Kurdish-HLR-Model/vocab.json
	```

	### Training

	```bash
	# Basic training (unique handwritten lines only)
	python Scripts/train.py \
	--data_dir ./data/DASTNUS \
	--vocab_path Kurdish-HLR-Model/vocab.json

	# Full training with synthetic lines + writer mixing (best configuration)
	python Scripts/train.py \
	--data_dir ./data/DASTNUS \
	--vocab_path Kurdish-HLR-Model/vocab.json \
	--use_synthetic \
	--synthetic_dir ./data/Synthetic-Lines \
	--use_writer_mixing \
	--fixed_lines_dir ./data/Fixed-Lines \
	--num_writers 50
	```

	### Synthetic Line Generation

	```bash
	python Scripts/synthetic_line_generator.py \
	--unique_words_dir ./data/Unique-Words \
	--person_names_dir ./data/Person-Names \
	--output_dir ./data/Synthetic-Lines \
	--training_writers ./writers/Training.txt \
	--validation_writers ./writers/Validation.txt \
	--testing_writers ./writers/Testing.txt
	```

	## Models

	\| Model \| Language \| Vocabulary \| Format \|
	\|-------\|----------\|-----------\|--------\|
	\| Kurdish-HLR-Model \| Kurdish (Sorani) \| 114 tokens \| safetensors \|
	\| Arabic-HLR-Model \| Arabic \| 192 tokens (unified) \| safetensors \|
	\| Urdu-HLR-Model \| Urdu \| 192 tokens (unified) \| safetensors \|

	The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.

	## Dataset

	The models were trained using the following subsets of the DASTNUS Kurdish handwritten dataset:

	\| Data Source \| Training \| Validation \| Testing \|
	\|-------------\|----------\|------------\|---------\|
	\| Unique handwritten lines \| 3,575 \| 655 \| 649 \|
	\| Synthetic handwritten lines \| 3,762 \| - \| - \|
	\| Fixed-content lines (50 writers) \| 512 \| - \| - \|
	\| Total \| 7,849 \| 655 \| 649 \|

	The data used in this research is available upon request for non-commercial scientific research purposes only.

	## Citation

	```bibtex
	[]
	```

	## License

	This repository is released for non-commercial scientific research purposes only.