BertForTokenClassificationWithFourO
A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.
Model Description
This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.
Task
The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:
- Spacing Mode: Uses pure model predictions to insert spaces
- Correction Mode: Combines model predictions with existing spacing in the text
Model Architecture
The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:
- Dense layer with ReLU activation
- Dropout for regularization
- Batch normalization
- Output projection layer
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch
# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()
# Initialize labeler
labeler = Labeler(tags=(1, 2),
regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
chars=(" ", "β"),
class_count=2)
# Process text
def process_text(text, mode="space"):
# Create a pipeline for processing
from run import ModelPipeline
pipeline = ModelPipeline(model_path)
result = pipeline.process_text(text, mode)
return result
# Example
text = "Ψ§ΫΩ Ω
ΨͺΩ ΩΩ
ΩΩΩ ΩΨ§Ψ±Ψ³Ϋ Ψ¨Ψ―ΩΩ ΩΨ§Ψ΅ΩΩ Ϊ―Ψ°Ψ§Ψ±Ϋ Ω
ΩΨ§Ψ³Ψ¨ Ψ§Ψ³Ψͺ"
result = process_text(text, mode="space")
print(result)
Command-line Usage
You can also use the provided command-line interface:
python run.py --text "Ω
ΨͺΩ ΩΨ§Ψ±Ψ³Ϋ Ψ΄Ω
Ψ§ Ψ―Ψ± Ψ§ΫΩΨ¬Ψ§" --mode space
Or process a file:
python run.py --file input.txt --output result.txt --mode correct
The repository includes a sample input.txt file that you can use to test the model.
Parameters
mode:space: Uses model predictions to add spacescorrect: Combines model predictions with original text spacing (recommended for texts with some correct spacing)
Evaluation
The model achieves excellent performance in both operating modes:
Spacing Mode Evaluation
βββββββββββ€ββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ
β Label β Precision β Recall β Accuracy β F1 Score β
βββββββββββͺββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββ‘
β 0 β 0.994663 β 0.997324 β 0.997324 β 0.995992 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 1 β 0.989546 β 0.987828 β 0.987828 β 0.988686 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 2 β 0.913413 β 0.932125 β 0.932125 β 0.922674 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β Average β 0.965874 β 0.972426 β 0.972426 β 0.969117 β
βββββββββββ§ββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ
Correction Mode Evaluation
βββββββββββ€ββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ
β Label β Precision β Recall β Accuracy β F1 Score β
βββββββββββͺββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββ‘
β 0 β 0.995932 β 0.998386 β 0.998386 β 0.997157 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 1 β 0.992917 β 0.992227 β 0.992227 β 0.992572 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 2 β 0.944612 β 0.959428 β 0.959428 β 0.951962 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β Average β 0.97782 β 0.983347 β 0.983347 β 0.980564 β
βββββββββββ§ββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ
Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.
Label Meaning
- Label 0: No spacing needed
- Label 1: Regular space character needed
- Label 2: ZWNJ character (β) needed
Use Cases
This model is particularly useful for:
- Correcting Persian text with improper spacing
- Normalizing text from different sources
- Improving text readability for downstream NLP tasks
- Preprocessing Persian text for search engines or text analysis
Training
The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.
Training hyperparameters:
- Learning rate: [VALUE]
- Batch size: [VALUE]
- Training steps: [VALUE]
- [OTHER PARAMETERS]
Limitations
- The model is specifically designed for Persian text
- Performance may vary on specialized domains or technical texts
- Very long texts should be processed in chunks for optimal performance
- Tuned for execution on devices with CUDA
- [ANY OTHER LIMITATIONS]
Citation
[CITATION_INFO]
License
[LICENSE_INFO]
Contact
For questions or feedback, please contact [CONTACT_INFO].
- Downloads last month
- 6