File size: 4,762 Bytes
1bf8c0b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | ---
language:
- ckb
- ar
- ur
license: cc-by-nc-4.0
tags:
- handwritten-text-recognition
- kurdish
- arabic
- urdu
- densenet
- transformer
- pytorch
- safetensors
datasets:
- DASTNUS
- KHATT
- PUCIT
metrics:
- cer
- wer
pipeline_tag: image-to-text
---
# KHLR: Kurdish Handwritten Line Recognition
**A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation**
This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.
---
## Repository Structure
```
KHLR/
βββ Kurdish-HLR-Model/ # Best Kurdish model (safetensors + config)
βββ Arabic-HLR-Model/ # Fine-tuned on KHATT Arabic dataset
βββ Urdu-HLR-Model/ # Fine-tuned on PUCIT Urdu dataset
βββ Scripts/
β βββ train.py # Main training script
β βββ synthetic_line_generator.py # Recipe-based synthetic line generation
β βββ inference.py # Single image / batch inference
βββ Sample/
β βββ sample_image.tif # Example handwritten line image
β βββ sample_image.txt # Corresponding ground truth
βββ requirements.txt
βββ README.md
```
## Architecture
| Component | Details |
|-----------|---------|
| CNN Backbone | DenseNet-121 (ImageNet pre-trained) |
| Encoder | 3 Transformer encoder layers |
| Decoder | 3 Transformer decoder layers |
| Attention Heads | 8 |
| Hidden Size | 256 |
| Feed-Forward Dim | 1024 |
| Total Parameters | ~12.8M |
## Performance
### Kurdish (DASTNUS)
| Configuration | CER | WER | CRR (%) |
|--------------|-----|-----|---------|
| +AA+SKHL+FHL-50 | 0.0593 | 0.3083 | 94.07 |
| +AA+SKHL+FHL-50 + 8-gram LM | 0.0534 | 0.2746 | 94.66 |
### Cross-Dataset Generalization
| Dataset | Language | CER | WER | CRR (%) |
|---------|----------|-----|-----|---------|
| KHATT | Arabic | 0.1135 | 0.4156 | 88.65 |
| PUCIT | Urdu | 0.0932 | 0.2799 | 90.68 |
## Installation
```bash
git clone https://huggingface.co/karez/KHLR
cd KHLR
pip install -r requirements.txt
```
## Quick Start
### Inference
```bash
# Single image (using .pth checkpoint)
python Scripts/inference.py \
--image Sample/sample_image.tif \
--model_path Kurdish-HLR-Model/model.safetensors \
--vocab_path Kurdish-HLR-Model/vocab.json
# Directory of images
python Scripts/inference.py \
--image_dir ./test_images \
--model_path Kurdish-HLR-Model/model.safetensors \
--vocab_path Kurdish-HLR-Model/vocab.json
```
### Training
```bash
# Basic training (unique handwritten lines only)
python Scripts/train.py \
--data_dir ./data/DASTNUS \
--vocab_path Kurdish-HLR-Model/vocab.json
# Full training with synthetic lines + writer mixing (best configuration)
python Scripts/train.py \
--data_dir ./data/DASTNUS \
--vocab_path Kurdish-HLR-Model/vocab.json \
--use_synthetic \
--synthetic_dir ./data/Synthetic-Lines \
--use_writer_mixing \
--fixed_lines_dir ./data/Fixed-Lines \
--num_writers 50
```
### Synthetic Line Generation
```bash
python Scripts/synthetic_line_generator.py \
--unique_words_dir ./data/Unique-Words \
--person_names_dir ./data/Person-Names \
--output_dir ./data/Synthetic-Lines \
--training_writers ./writers/Training.txt \
--validation_writers ./writers/Validation.txt \
--testing_writers ./writers/Testing.txt
```
## Models
| Model | Language | Vocabulary | Format |
|-------|----------|-----------|--------|
| Kurdish-HLR-Model | Kurdish (Sorani) | 114 tokens | safetensors |
| Arabic-HLR-Model | Arabic | 192 tokens (unified) | safetensors |
| Urdu-HLR-Model | Urdu | 192 tokens (unified) | safetensors |
The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.
## Dataset
The models were trained using the following subsets of the **DASTNUS** Kurdish handwritten dataset:
| Data Source | Training | Validation | Testing |
|-------------|----------|------------|---------|
| Unique handwritten lines | 3,575 | 655 | 649 |
| Synthetic handwritten lines | 3,762 | - | - |
| Fixed-content lines (50 writers) | 512 | - | - |
| **Total** | **7,849** | **655** | **649** |
The data used in this research is available upon request for non-commercial scientific research purposes only.
## Citation
```bibtex
[]
```
## License
This repository is released for **non-commercial scientific research purposes only**. |