File size: 6,955 Bytes
07b65ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# BertForTokenClassificationWithFourO

[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/USERNAME/MODEL_NAME)

A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.

## Model Description

This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.

### Task

The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:
- **Spacing Mode**: Uses pure model predictions to insert spaces
- **Correction Mode**: Combines model predictions with existing spacing in the text

### Model Architecture

The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:
- Dense layer with ReLU activation
- Dropout for regularization
- Batch normalization
- Output projection layer

## Usage

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch

# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()

# Initialize labeler
labeler = Labeler(tags=(1, 2),
                 regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
                 chars=(" ", "β€Œ"),
                 class_count=2)

# Process text
def process_text(text, mode="space"):
    # Create a pipeline for processing
    from run import ModelPipeline
    pipeline = ModelPipeline(model_path)
    result = pipeline.process_text(text, mode)
    return result

# Example
text = "Ψ§ΫŒΩ† Ω…ΨͺΩ† Ω†Ω…ΩˆΩ†Ω‡ فارسی Ψ¨Ψ―ΩˆΩ† فاءله گذاری Ω…Ω†Ψ§Ψ³Ψ¨ Ψ§Ψ³Ψͺ"
result = process_text(text, mode="space")
print(result)
```

### Command-line Usage

You can also use the provided command-line interface:

```bash
python run.py --text "Ω…ΨͺΩ† فارسی Ψ΄Ω…Ψ§ Ψ―Ψ± Ψ§ΫŒΩ†Ψ¬Ψ§" --mode space
```

Or process a file:

```bash
python run.py --file input.txt --output result.txt --mode correct
```

The repository includes a sample `input.txt` file that you can use to test the model.

## Parameters

- `mode`: 
  - `space`: Uses model predictions to add spaces
  - `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing)

## Evaluation

The model achieves excellent performance in both operating modes:

### Spacing Mode Evaluation

```
╒═════════╀═════════════╀══════════╀════════════╀════════════╕
β”‚ Label   β”‚   Precision β”‚   Recall β”‚   Accuracy β”‚   F1 Score β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════β•ͺ════════════β•ͺ════════════║
β”‚ 0       β”‚    0.994663 β”‚ 0.997324 β”‚   0.997324 β”‚   0.995992 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1       β”‚    0.989546 β”‚ 0.987828 β”‚   0.987828 β”‚   0.988686 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2       β”‚    0.913413 β”‚ 0.932125 β”‚   0.932125 β”‚   0.922674 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average β”‚    0.965874 β”‚ 0.972426 β”‚   0.972426 β”‚   0.969117 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•›
```

### Correction Mode Evaluation

```
╒═════════╀═════════════╀══════════╀════════════╀════════════╕
β”‚ Label   β”‚   Precision β”‚   Recall β”‚   Accuracy β”‚   F1 Score β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════β•ͺ════════════β•ͺ════════════║
β”‚ 0       β”‚    0.995932 β”‚ 0.998386 β”‚   0.998386 β”‚   0.997157 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1       β”‚    0.992917 β”‚ 0.992227 β”‚   0.992227 β”‚   0.992572 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2       β”‚    0.944612 β”‚ 0.959428 β”‚   0.959428 β”‚   0.951962 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average β”‚    0.97782  β”‚ 0.983347 β”‚   0.983347 β”‚   0.980564 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•›
```

Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.

### Label Meaning
- Label 0: No spacing needed
- Label 1: Regular space character needed
- Label 2: ZWNJ character (β€Œ) needed

## Use Cases

This model is particularly useful for:
- Correcting Persian text with improper spacing
- Normalizing text from different sources
- Improving text readability for downstream NLP tasks
- Preprocessing Persian text for search engines or text analysis

## Training

The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.

Training hyperparameters:
- Learning rate: [VALUE]
- Batch size: [VALUE]
- Training steps: [VALUE]
- [OTHER PARAMETERS]

## Limitations

- The model is specifically designed for Persian text
- Performance may vary on specialized domains or technical texts
- Very long texts should be processed in chunks for optimal performance
- Tuned for execution on devices with CUDA
- [ANY OTHER LIMITATIONS]

## Citation

```
[CITATION_INFO]
```

## License

[LICENSE_INFO]

## Contact

For questions or feedback, please contact [CONTACT_INFO].