File size: 4,762 Bytes

1bf8c0b

---

language:
  - ckb
  - ar
  - ur
license: cc-by-nc-4.0
tags:
  - handwritten-text-recognition
  - kurdish
  - arabic
  - urdu
  - densenet
  - transformer
  - pytorch
  - safetensors
datasets:
  - DASTNUS
  - KHATT
  - PUCIT
metrics:
  - cer
  - wer
pipeline_tag: image-to-text
---


# KHLR: Kurdish Handwritten Line Recognition

**A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation**

This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.

---

## Repository Structure

```

KHLR/

├── Kurdish-HLR-Model/      # Best Kurdish model (safetensors + config)

├── Arabic-HLR-Model/         # Fine-tuned on KHATT Arabic dataset

├── Urdu-HLR-Model/           # Fine-tuned on PUCIT Urdu dataset

├── Scripts/

│   ├── train.py                # Main training script

│   ├── synthetic_line_generator.py  # Recipe-based synthetic line generation

│   └── inference.py            # Single image / batch inference

├── Sample/

│   ├── sample_image.tif        # Example handwritten line image

│   └── sample_image.txt        # Corresponding ground truth

├── requirements.txt

└── README.md

```

## Architecture

| Component | Details |
|-----------|---------|
| CNN Backbone | DenseNet-121 (ImageNet pre-trained) |
| Encoder | 3 Transformer encoder layers |
| Decoder | 3 Transformer decoder layers |
| Attention Heads | 8 |
| Hidden Size | 256 |
| Feed-Forward Dim | 1024 |
| Total Parameters | ~12.8M |

## Performance

### Kurdish (DASTNUS)

| Configuration | CER | WER | CRR (%) |
|--------------|-----|-----|---------|
| +AA+SKHL+FHL-50 | 0.0593 | 0.3083 | 94.07 |
| +AA+SKHL+FHL-50 + 8-gram LM | 0.0534 | 0.2746 | 94.66 |

### Cross-Dataset Generalization

| Dataset | Language | CER | WER | CRR (%) |
|---------|----------|-----|-----|---------|
| KHATT | Arabic | 0.1135 | 0.4156 | 88.65 |
| PUCIT | Urdu | 0.0932 | 0.2799 | 90.68 |

## Installation

```bash

git clone https://huggingface.co/karez/KHLR

cd KHLR

pip install -r requirements.txt

```

## Quick Start

### Inference

```bash

# Single image (using .pth checkpoint)

python Scripts/inference.py \

    --image Sample/sample_image.tif \

    --model_path Kurdish-HLR-Model/model.safetensors \

    --vocab_path Kurdish-HLR-Model/vocab.json



# Directory of images

python Scripts/inference.py \

    --image_dir ./test_images \

    --model_path Kurdish-HLR-Model/model.safetensors \

    --vocab_path Kurdish-HLR-Model/vocab.json

```

### Training

```bash

# Basic training (unique handwritten lines only)

python Scripts/train.py \

    --data_dir ./data/DASTNUS \

    --vocab_path Kurdish-HLR-Model/vocab.json



# Full training with synthetic lines + writer mixing (best configuration)

python Scripts/train.py \

    --data_dir ./data/DASTNUS \

    --vocab_path Kurdish-HLR-Model/vocab.json \

    --use_synthetic \

    --synthetic_dir ./data/Synthetic-Lines \

    --use_writer_mixing \

    --fixed_lines_dir ./data/Fixed-Lines \

    --num_writers 50

```

### Synthetic Line Generation

```bash

python Scripts/synthetic_line_generator.py \

    --unique_words_dir ./data/Unique-Words \

    --person_names_dir ./data/Person-Names \

    --output_dir ./data/Synthetic-Lines \

    --training_writers ./writers/Training.txt \

    --validation_writers ./writers/Validation.txt \

    --testing_writers ./writers/Testing.txt

```

## Models

| Model | Language | Vocabulary | Format |
|-------|----------|-----------|--------|
| Kurdish-HLR-Model | Kurdish (Sorani) | 114 tokens | safetensors |
| Arabic-HLR-Model | Arabic | 192 tokens (unified) | safetensors |
| Urdu-HLR-Model | Urdu | 192 tokens (unified) | safetensors |

The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.

## Dataset

The models were trained using the following subsets of the **DASTNUS** Kurdish handwritten dataset:

| Data Source | Training | Validation | Testing |
|-------------|----------|------------|---------|
| Unique handwritten lines | 3,575 | 655 | 649 |
| Synthetic handwritten lines | 3,762 | - | - |
| Fixed-content lines (50 writers) | 512 | - | - |
| **Total** | **7,849** | **655** | **649** |

The data used in this research is available upon request for non-commercial scientific research purposes only.

## Citation

```bibtex

[]

```

## License

This repository is released for **non-commercial scientific research purposes only**.