YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BertForTokenClassificationWithFourO

Hugging Face

A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.

Model Description

This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.

Task

The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:

  • Spacing Mode: Uses pure model predictions to insert spaces
  • Correction Mode: Combines model predictions with existing spacing in the text

Model Architecture

The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:

  • Dense layer with ReLU activation
  • Dropout for regularization
  • Batch normalization
  • Output projection layer

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch

# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()

# Initialize labeler
labeler = Labeler(tags=(1, 2),
                 regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
                 chars=(" ", "β€Œ"),
                 class_count=2)

# Process text
def process_text(text, mode="space"):
    # Create a pipeline for processing
    from run import ModelPipeline
    pipeline = ModelPipeline(model_path)
    result = pipeline.process_text(text, mode)
    return result

# Example
text = "Ψ§ΫŒΩ† Ω…ΨͺΩ† Ω†Ω…ΩˆΩ†Ω‡ فارسی Ψ¨Ψ―ΩˆΩ† فاءله گذاری Ω…Ω†Ψ§Ψ³Ψ¨ Ψ§Ψ³Ψͺ"
result = process_text(text, mode="space")
print(result)

Command-line Usage

You can also use the provided command-line interface:

python run.py --text "Ω…ΨͺΩ† فارسی Ψ΄Ω…Ψ§ Ψ―Ψ± Ψ§ΫŒΩ†Ψ¬Ψ§" --mode space

Or process a file:

python run.py --file input.txt --output result.txt --mode correct

The repository includes a sample input.txt file that you can use to test the model.

Parameters

  • mode:
    • space: Uses model predictions to add spaces
    • correct: Combines model predictions with original text spacing (recommended for texts with some correct spacing)

Evaluation

The model achieves excellent performance in both operating modes:

Spacing Mode Evaluation

╒═════════╀═════════════╀══════════╀════════════╀════════════╕
β”‚ Label   β”‚   Precision β”‚   Recall β”‚   Accuracy β”‚   F1 Score β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════β•ͺ════════════β•ͺ════════════║
β”‚ 0       β”‚    0.994663 β”‚ 0.997324 β”‚   0.997324 β”‚   0.995992 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1       β”‚    0.989546 β”‚ 0.987828 β”‚   0.987828 β”‚   0.988686 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2       β”‚    0.913413 β”‚ 0.932125 β”‚   0.932125 β”‚   0.922674 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average β”‚    0.965874 β”‚ 0.972426 β”‚   0.972426 β”‚   0.969117 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•›

Correction Mode Evaluation

╒═════════╀═════════════╀══════════╀════════════╀════════════╕
β”‚ Label   β”‚   Precision β”‚   Recall β”‚   Accuracy β”‚   F1 Score β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════β•ͺ════════════β•ͺ════════════║
β”‚ 0       β”‚    0.995932 β”‚ 0.998386 β”‚   0.998386 β”‚   0.997157 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1       β”‚    0.992917 β”‚ 0.992227 β”‚   0.992227 β”‚   0.992572 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2       β”‚    0.944612 β”‚ 0.959428 β”‚   0.959428 β”‚   0.951962 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average β”‚    0.97782  β”‚ 0.983347 β”‚   0.983347 β”‚   0.980564 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•›

Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.

Label Meaning

  • Label 0: No spacing needed
  • Label 1: Regular space character needed
  • Label 2: ZWNJ character (β€Œ) needed

Use Cases

This model is particularly useful for:

  • Correcting Persian text with improper spacing
  • Normalizing text from different sources
  • Improving text readability for downstream NLP tasks
  • Preprocessing Persian text for search engines or text analysis

Training

The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.

Training hyperparameters:

  • Learning rate: [VALUE]
  • Batch size: [VALUE]
  • Training steps: [VALUE]
  • [OTHER PARAMETERS]

Limitations

  • The model is specifically designed for Persian text
  • Performance may vary on specialized domains or technical texts
  • Very long texts should be processed in chunks for optimal performance
  • Tuned for execution on devices with CUDA
  • [ANY OTHER LIMITATIONS]

Citation

[CITATION_INFO]

License

[LICENSE_INFO]

Contact

For questions or feedback, please contact [CONTACT_INFO].

Downloads last month
8
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support