File size: 4,762 Bytes
1bf8c0b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---

language:
  - ckb
  - ar
  - ur
license: cc-by-nc-4.0
tags:
  - handwritten-text-recognition
  - kurdish
  - arabic
  - urdu
  - densenet
  - transformer
  - pytorch
  - safetensors
datasets:
  - DASTNUS
  - KHATT
  - PUCIT
metrics:
  - cer
  - wer
pipeline_tag: image-to-text
---


# KHLR: Kurdish Handwritten Line Recognition

**A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation**

This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.

---

## Repository Structure

```

KHLR/

β”œβ”€β”€ Kurdish-HLR-Model/      # Best Kurdish model (safetensors + config)

β”œβ”€β”€ Arabic-HLR-Model/         # Fine-tuned on KHATT Arabic dataset

β”œβ”€β”€ Urdu-HLR-Model/           # Fine-tuned on PUCIT Urdu dataset

β”œβ”€β”€ Scripts/

β”‚   β”œβ”€β”€ train.py                # Main training script

β”‚   β”œβ”€β”€ synthetic_line_generator.py  # Recipe-based synthetic line generation

β”‚   └── inference.py            # Single image / batch inference

β”œβ”€β”€ Sample/

β”‚   β”œβ”€β”€ sample_image.tif        # Example handwritten line image

β”‚   └── sample_image.txt        # Corresponding ground truth

β”œβ”€β”€ requirements.txt

└── README.md

```

## Architecture

| Component | Details |
|-----------|---------|
| CNN Backbone | DenseNet-121 (ImageNet pre-trained) |
| Encoder | 3 Transformer encoder layers |
| Decoder | 3 Transformer decoder layers |
| Attention Heads | 8 |
| Hidden Size | 256 |
| Feed-Forward Dim | 1024 |
| Total Parameters | ~12.8M |

## Performance

### Kurdish (DASTNUS)

| Configuration | CER | WER | CRR (%) |
|--------------|-----|-----|---------|
| +AA+SKHL+FHL-50 | 0.0593 | 0.3083 | 94.07 |
| +AA+SKHL+FHL-50 + 8-gram LM | 0.0534 | 0.2746 | 94.66 |

### Cross-Dataset Generalization

| Dataset | Language | CER | WER | CRR (%) |
|---------|----------|-----|-----|---------|
| KHATT | Arabic | 0.1135 | 0.4156 | 88.65 |
| PUCIT | Urdu | 0.0932 | 0.2799 | 90.68 |

## Installation

```bash

git clone https://huggingface.co/karez/KHLR

cd KHLR

pip install -r requirements.txt

```

## Quick Start

### Inference

```bash

# Single image (using .pth checkpoint)

python Scripts/inference.py \

    --image Sample/sample_image.tif \

    --model_path Kurdish-HLR-Model/model.safetensors \

    --vocab_path Kurdish-HLR-Model/vocab.json



# Directory of images

python Scripts/inference.py \

    --image_dir ./test_images \

    --model_path Kurdish-HLR-Model/model.safetensors \

    --vocab_path Kurdish-HLR-Model/vocab.json

```

### Training

```bash

# Basic training (unique handwritten lines only)

python Scripts/train.py \

    --data_dir ./data/DASTNUS \

    --vocab_path Kurdish-HLR-Model/vocab.json



# Full training with synthetic lines + writer mixing (best configuration)

python Scripts/train.py \

    --data_dir ./data/DASTNUS \

    --vocab_path Kurdish-HLR-Model/vocab.json \

    --use_synthetic \

    --synthetic_dir ./data/Synthetic-Lines \

    --use_writer_mixing \

    --fixed_lines_dir ./data/Fixed-Lines \

    --num_writers 50

```

### Synthetic Line Generation

```bash

python Scripts/synthetic_line_generator.py \

    --unique_words_dir ./data/Unique-Words \

    --person_names_dir ./data/Person-Names \

    --output_dir ./data/Synthetic-Lines \

    --training_writers ./writers/Training.txt \

    --validation_writers ./writers/Validation.txt \

    --testing_writers ./writers/Testing.txt

```

## Models

| Model | Language | Vocabulary | Format |
|-------|----------|-----------|--------|
| Kurdish-HLR-Model | Kurdish (Sorani) | 114 tokens | safetensors |
| Arabic-HLR-Model | Arabic | 192 tokens (unified) | safetensors |
| Urdu-HLR-Model | Urdu | 192 tokens (unified) | safetensors |

The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.

## Dataset

The models were trained using the following subsets of the **DASTNUS** Kurdish handwritten dataset:

| Data Source | Training | Validation | Testing |
|-------------|----------|------------|---------|
| Unique handwritten lines | 3,575 | 655 | 649 |
| Synthetic handwritten lines | 3,762 | - | - |
| Fixed-content lines (50 writers) | 512 | - | - |
| **Total** | **7,849** | **655** | **649** |

The data used in this research is available upon request for non-commercial scientific research purposes only.

## Citation

```bibtex

[]

```

## License

This repository is released for **non-commercial scientific research purposes only**.