BertForTokenClassificationWithFourO

A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.

Model Description

This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.

Task

The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:

Spacing Mode: Uses pure model predictions to insert spaces
Correction Mode: Combines model predictions with existing spacing in the text

Model Architecture

The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:

Dense layer with ReLU activation
Dropout for regularization
Batch normalization
Output projection layer

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch

# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()

# Initialize labeler
labeler = Labeler(tags=(1, 2),
                 regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
                 chars=(" ", "‌"),
                 class_count=2)

# Process text
def process_text(text, mode="space"):
    # Create a pipeline for processing
    from run import ModelPipeline
    pipeline = ModelPipeline(model_path)
    result = pipeline.process_text(text, mode)
    return result

# Example
text = "این متن نمونه فارسی بدون فاصله گذاری مناسب است"
result = process_text(text, mode="space")
print(result)

Command-line Usage

You can also use the provided command-line interface:

python run.py --text "متن فارسی شما در اینجا" --mode space

Or process a file:

python run.py --file input.txt --output result.txt --mode correct

The repository includes a sample input.txt file that you can use to test the model.

Parameters

mode:
- space: Uses model predictions to add spaces
- correct: Combines model predictions with original text spacing (recommended for texts with some correct spacing)

Evaluation

The model achieves excellent performance in both operating modes:

Spacing Mode Evaluation

╒═════════╤═════════════╤══════════╤════════════╤════════════╕
│ Label   │   Precision │   Recall │   Accuracy │   F1 Score │
╞═════════╪═════════════╪══════════╪════════════╪════════════╡
│ 0       │    0.994663 │ 0.997324 │   0.997324 │   0.995992 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 1       │    0.989546 │ 0.987828 │   0.987828 │   0.988686 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 2       │    0.913413 │ 0.932125 │   0.932125 │   0.922674 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ Average │    0.965874 │ 0.972426 │   0.972426 │   0.969117 │
╘═════════╧═════════════╧══════════╧════════════╧════════════╛

Correction Mode Evaluation

╒═════════╤═════════════╤══════════╤════════════╤════════════╕
│ Label   │   Precision │   Recall │   Accuracy │   F1 Score │
╞═════════╪═════════════╪══════════╪════════════╪════════════╡
│ 0       │    0.995932 │ 0.998386 │   0.998386 │   0.997157 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 1       │    0.992917 │ 0.992227 │   0.992227 │   0.992572 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ 2       │    0.944612 │ 0.959428 │   0.959428 │   0.951962 │
├─────────┼─────────────┼──────────┼────────────┼────────────┤
│ Average │    0.97782  │ 0.983347 │   0.983347 │   0.980564 │
╘═════════╧═════════════╧══════════╧════════════╧════════════╛

Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.

Label Meaning

Label 0: No spacing needed
Label 1: Regular space character needed
Label 2: ZWNJ character (‌) needed

Use Cases

This model is particularly useful for:

Correcting Persian text with improper spacing
Normalizing text from different sources
Improving text readability for downstream NLP tasks
Preprocessing Persian text for search engines or text analysis

Training

The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.

Training hyperparameters:

Learning rate: [VALUE]
Batch size: [VALUE]
Training steps: [VALUE]
[OTHER PARAMETERS]

Limitations

The model is specifically designed for Persian text
Performance may vary on specialized domains or technical texts
Very long texts should be processed in chunks for optimal performance
Tuned for execution on devices with CUDA
[ANY OTHER LIMITATIONS]

Citation

[CITATION_INFO]

License

[LICENSE_INFO]

Contact

For questions or feedback, please contact [CONTACT_INFO].

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support