# BertForTokenClassificationWithFourO [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/USERNAME/MODEL_NAME) A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks. ## Model Description This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters. ### Task The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes: - **Spacing Mode**: Uses pure model predictions to insert spaces - **Correction Mode**: Combines model predictions with existing spacing in the text ### Model Architecture The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes: - Dense layer with ReLU activation - Dropout for regularization - Batch normalization - Output projection layer ## Usage ### Installation ```bash pip install transformers torch ``` ### Basic Usage ```python from transformers import AutoTokenizer from modeling_custom import BertForTokenClassificationWithFourO from labeler import Labeler import torch # Load model and tokenizer model_path = "USERNAME/MODEL_NAME" tokenizer = AutoTokenizer.from_pretrained(model_path) model = BertForTokenClassificationWithFourO.from_pretrained(model_path) model.eval() # Initialize labeler labeler = Labeler(tags=(1, 2), regexes=(r'[^\S\r\n\v\f]', r'\u200c'), chars=(" ", "‌"), class_count=2) # Process text def process_text(text, mode="space"): # Create a pipeline for processing from run import ModelPipeline pipeline = ModelPipeline(model_path) result = pipeline.process_text(text, mode) return result # Example text = "این متن نمونه فارسی بدون فاصله گذاری مناسب است" result = process_text(text, mode="space") print(result) ``` ### Command-line Usage You can also use the provided command-line interface: ```bash python run.py --text "متن فارسی شما در اینجا" --mode space ``` Or process a file: ```bash python run.py --file input.txt --output result.txt --mode correct ``` The repository includes a sample `input.txt` file that you can use to test the model. ## Parameters - `mode`: - `space`: Uses model predictions to add spaces - `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing) ## Evaluation The model achieves excellent performance in both operating modes: ### Spacing Mode Evaluation ``` ╒═════════╤═════════════╤══════════╤════════════╤════════════╕ │ Label │ Precision │ Recall │ Accuracy │ F1 Score │ ╞═════════╪═════════════╪══════════╪════════════╪════════════╡ │ 0 │ 0.994663 │ 0.997324 │ 0.997324 │ 0.995992 │ ├─────────┼─────────────┼──────────┼────────────┼────────────┤ │ 1 │ 0.989546 │ 0.987828 │ 0.987828 │ 0.988686 │ ├─────────┼─────────────┼──────────┼────────────┼────────────┤ │ 2 │ 0.913413 │ 0.932125 │ 0.932125 │ 0.922674 │ ├─────────┼─────────────┼──────────┼────────────┼────────────┤ │ Average │ 0.965874 │ 0.972426 │ 0.972426 │ 0.969117 │ ╘═════════╧═════════════╧══════════╧════════════╧════════════╛ ``` ### Correction Mode Evaluation ``` ╒═════════╤═════════════╤══════════╤════════════╤════════════╕ │ Label │ Precision │ Recall │ Accuracy │ F1 Score │ ╞═════════╪═════════════╪══════════╪════════════╪════════════╡ │ 0 │ 0.995932 │ 0.998386 │ 0.998386 │ 0.997157 │ ├─────────┼─────────────┼──────────┼────────────┼────────────┤ │ 1 │ 0.992917 │ 0.992227 │ 0.992227 │ 0.992572 │ ├─────────┼─────────────┼──────────┼────────────┼────────────┤ │ 2 │ 0.944612 │ 0.959428 │ 0.959428 │ 0.951962 │ ├─────────┼─────────────┼──────────┼────────────┼────────────┤ │ Average │ 0.97782 │ 0.983347 │ 0.983347 │ 0.980564 │ ╘═════════╧═════════════╧══════════╧════════════╧════════════╛ ``` Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing. ### Label Meaning - Label 0: No spacing needed - Label 1: Regular space character needed - Label 2: ZWNJ character (‌) needed ## Use Cases This model is particularly useful for: - Correcting Persian text with improper spacing - Normalizing text from different sources - Improving text readability for downstream NLP tasks - Preprocessing Persian text for search engines or text analysis ## Training The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations. Training hyperparameters: - Learning rate: [VALUE] - Batch size: [VALUE] - Training steps: [VALUE] - [OTHER PARAMETERS] ## Limitations - The model is specifically designed for Persian text - Performance may vary on specialized domains or technical texts - Very long texts should be processed in chunks for optimal performance - Tuned for execution on devices with CUDA - [ANY OTHER LIMITATIONS] ## Citation ``` [CITATION_INFO] ``` ## License [LICENSE_INFO] ## Contact For questions or feedback, please contact [CONTACT_INFO].