File size: 6,955 Bytes

07b65ad

# BertForTokenClassificationWithFourO

[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/USERNAME/MODEL_NAME)

A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.

## Model Description

This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.

### Task

The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:
- **Spacing Mode**: Uses pure model predictions to insert spaces
- **Correction Mode**: Combines model predictions with existing spacing in the text

### Model Architecture

The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:
- Dense layer with ReLU activation
- Dropout for regularization
- Batch normalization
- Output projection layer

## Usage

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch

# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()

# Initialize labeler
labeler = Labeler(tags=(1, 2),
                 regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
                 chars=(" ", "‌"),
                 class_count=2)

# Process text
def process_text(text, mode="space"):
    # Create a pipeline for processing
    from run import ModelPipeline
    pipeline = ModelPipeline(model_path)
    result = pipeline.process_text(text, mode)
    return result

# Example
text = "این متن نمونه فارسی بدون فاصله گذاری مناسب است"
result = process_text(text, mode="space")
print(result)
```

### Command-line Usage

You can also use the provided command-line interface:

```bash
python run.py --text "متن فارسی شما در اینجا" --mode space
```

Or process a file:

```bash
python run.py --file input.txt --output result.txt --mode correct
```

The repository includes a sample `input.txt` file that you can use to test the model.

## Parameters

- `mode`: 
  - `space`: Uses model predictions to add spaces
  - `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing)

## Evaluation

The model achieves excellent performance in both operating modes:

### Spacing Mode Evaluation

```
╒═════════╤═════════════╤══════════╤════════════╤════════════╕
│ Label   │   Precision │   Recall │   Accuracy │   F1 Score │
╞═════════╪═════════════╪══════════╪════════════╪════════════╡
│ 0       │    0.994663 │ 0.997324 │   0.997324 │   0.995992 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 1       │    0.989546 │ 0.987828 │   0.987828 │   0.988686 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 2       │    0.913413 │ 0.932125 │   0.932125 │   0.922674 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ Average │    0.965874 │ 0.972426 │   0.972426 │   0.969117 │
╘═════════╧═════════════╧══════════╧════════════╧════════════╛
```

### Correction Mode Evaluation

```
╒═════════╤═════════════╤══════════╤════════════╤════════════╕
│ Label   │   Precision │   Recall │   Accuracy │   F1 Score │
╞═════════╪═════════════╪══════════╪════════════╪════════════╡
│ 0       │    0.995932 │ 0.998386 │   0.998386 │   0.997157 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 1       │    0.992917 │ 0.992227 │   0.992227 │   0.992572 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 2       │    0.944612 │ 0.959428 │   0.959428 │   0.951962 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ Average │    0.97782  │ 0.983347 │   0.983347 │   0.980564 │
╘═════════╧═════════════╧══════════╧════════════╧════════════╛
```

Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.

### Label Meaning
- Label 0: No spacing needed
- Label 1: Regular space character needed
- Label 2: ZWNJ character (‌) needed

## Use Cases

This model is particularly useful for:
- Correcting Persian text with improper spacing
- Normalizing text from different sources
- Improving text readability for downstream NLP tasks
- Preprocessing Persian text for search engines or text analysis

## Training

The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.

Training hyperparameters:
- Learning rate: [VALUE]
- Batch size: [VALUE]
- Training steps: [VALUE]
- [OTHER PARAMETERS]

## Limitations

- The model is specifically designed for Persian text
- Performance may vary on specialized domains or technical texts
- Very long texts should be processed in chunks for optimal performance
- Tuned for execution on devices with CUDA
- [ANY OTHER LIMITATIONS]

## Citation

```
[CITATION_INFO]
```

## License

[LICENSE_INFO]

## Contact

For questions or feedback, please contact [CONTACT_INFO].