| # BertForTokenClassificationWithFourO | |
| [](https://huggingface.co/USERNAME/MODEL_NAME) | |
| A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks. | |
| ## Model Description | |
| This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters. | |
| ### Task | |
| The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes: | |
| - **Spacing Mode**: Uses pure model predictions to insert spaces | |
| - **Correction Mode**: Combines model predictions with existing spacing in the text | |
| ### Model Architecture | |
| The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes: | |
| - Dense layer with ReLU activation | |
| - Dropout for regularization | |
| - Batch normalization | |
| - Output projection layer | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| from transformers import AutoTokenizer | |
| from modeling_custom import BertForTokenClassificationWithFourO | |
| from labeler import Labeler | |
| import torch | |
| # Load model and tokenizer | |
| model_path = "USERNAME/MODEL_NAME" | |
| tokenizer = AutoTokenizer.from_pretrained(model_path) | |
| model = BertForTokenClassificationWithFourO.from_pretrained(model_path) | |
| model.eval() | |
| # Initialize labeler | |
| labeler = Labeler(tags=(1, 2), | |
| regexes=(r'[^\S\r\n\v\f]', r'\u200c'), | |
| chars=(" ", "β"), | |
| class_count=2) | |
| # Process text | |
| def process_text(text, mode="space"): | |
| # Create a pipeline for processing | |
| from run import ModelPipeline | |
| pipeline = ModelPipeline(model_path) | |
| result = pipeline.process_text(text, mode) | |
| return result | |
| # Example | |
| text = "Ψ§ΫΩ Ω ΨͺΩ ΩΩ ΩΩΩ ΩΨ§Ψ±Ψ³Ϋ Ψ¨Ψ―ΩΩ ΩΨ§Ψ΅ΩΩ Ϊ―Ψ°Ψ§Ψ±Ϋ Ω ΩΨ§Ψ³Ψ¨ Ψ§Ψ³Ψͺ" | |
| result = process_text(text, mode="space") | |
| print(result) | |
| ``` | |
| ### Command-line Usage | |
| You can also use the provided command-line interface: | |
| ```bash | |
| python run.py --text "Ω ΨͺΩ ΩΨ§Ψ±Ψ³Ϋ Ψ΄Ω Ψ§ Ψ―Ψ± Ψ§ΫΩΨ¬Ψ§" --mode space | |
| ``` | |
| Or process a file: | |
| ```bash | |
| python run.py --file input.txt --output result.txt --mode correct | |
| ``` | |
| The repository includes a sample `input.txt` file that you can use to test the model. | |
| ## Parameters | |
| - `mode`: | |
| - `space`: Uses model predictions to add spaces | |
| - `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing) | |
| ## Evaluation | |
| The model achieves excellent performance in both operating modes: | |
| ### Spacing Mode Evaluation | |
| ``` | |
| βββββββββββ€ββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ | |
| β Label β Precision β Recall β Accuracy β F1 Score β | |
| βββββββββββͺββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββ‘ | |
| β 0 β 0.994663 β 0.997324 β 0.997324 β 0.995992 β | |
| βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€ | |
| β 1 β 0.989546 β 0.987828 β 0.987828 β 0.988686 β | |
| βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€ | |
| β 2 β 0.913413 β 0.932125 β 0.932125 β 0.922674 β | |
| βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€ | |
| β Average β 0.965874 β 0.972426 β 0.972426 β 0.969117 β | |
| βββββββββββ§ββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ | |
| ``` | |
| ### Correction Mode Evaluation | |
| ``` | |
| βββββββββββ€ββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ | |
| β Label β Precision β Recall β Accuracy β F1 Score β | |
| βββββββββββͺββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββ‘ | |
| β 0 β 0.995932 β 0.998386 β 0.998386 β 0.997157 β | |
| βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€ | |
| β 1 β 0.992917 β 0.992227 β 0.992227 β 0.992572 β | |
| βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€ | |
| β 2 β 0.944612 β 0.959428 β 0.959428 β 0.951962 β | |
| βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€ | |
| β Average β 0.97782 β 0.983347 β 0.983347 β 0.980564 β | |
| βββββββββββ§ββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ | |
| ``` | |
| Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing. | |
| ### Label Meaning | |
| - Label 0: No spacing needed | |
| - Label 1: Regular space character needed | |
| - Label 2: ZWNJ character (β) needed | |
| ## Use Cases | |
| This model is particularly useful for: | |
| - Correcting Persian text with improper spacing | |
| - Normalizing text from different sources | |
| - Improving text readability for downstream NLP tasks | |
| - Preprocessing Persian text for search engines or text analysis | |
| ## Training | |
| The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations. | |
| Training hyperparameters: | |
| - Learning rate: [VALUE] | |
| - Batch size: [VALUE] | |
| - Training steps: [VALUE] | |
| - [OTHER PARAMETERS] | |
| ## Limitations | |
| - The model is specifically designed for Persian text | |
| - Performance may vary on specialized domains or technical texts | |
| - Very long texts should be processed in chunks for optimal performance | |
| - Tuned for execution on devices with CUDA | |
| - [ANY OTHER LIMITATIONS] | |
| ## Citation | |
| ``` | |
| [CITATION_INFO] | |
| ``` | |
| ## License | |
| [LICENSE_INFO] | |
| ## Contact | |
| For questions or feedback, please contact [CONTACT_INFO]. |