File size: 6,955 Bytes
07b65ad |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
# BertForTokenClassificationWithFourO
[](https://huggingface.co/USERNAME/MODEL_NAME)
A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.
## Model Description
This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.
### Task
The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:
- **Spacing Mode**: Uses pure model predictions to insert spaces
- **Correction Mode**: Combines model predictions with existing spacing in the text
### Model Architecture
The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:
- Dense layer with ReLU activation
- Dropout for regularization
- Batch normalization
- Output projection layer
## Usage
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch
# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()
# Initialize labeler
labeler = Labeler(tags=(1, 2),
regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
chars=(" ", "β"),
class_count=2)
# Process text
def process_text(text, mode="space"):
# Create a pipeline for processing
from run import ModelPipeline
pipeline = ModelPipeline(model_path)
result = pipeline.process_text(text, mode)
return result
# Example
text = "Ψ§ΫΩ Ω
ΨͺΩ ΩΩ
ΩΩΩ ΩΨ§Ψ±Ψ³Ϋ Ψ¨Ψ―ΩΩ ΩΨ§Ψ΅ΩΩ Ϊ―Ψ°Ψ§Ψ±Ϋ Ω
ΩΨ§Ψ³Ψ¨ Ψ§Ψ³Ψͺ"
result = process_text(text, mode="space")
print(result)
```
### Command-line Usage
You can also use the provided command-line interface:
```bash
python run.py --text "Ω
ΨͺΩ ΩΨ§Ψ±Ψ³Ϋ Ψ΄Ω
Ψ§ Ψ―Ψ± Ψ§ΫΩΨ¬Ψ§" --mode space
```
Or process a file:
```bash
python run.py --file input.txt --output result.txt --mode correct
```
The repository includes a sample `input.txt` file that you can use to test the model.
## Parameters
- `mode`:
- `space`: Uses model predictions to add spaces
- `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing)
## Evaluation
The model achieves excellent performance in both operating modes:
### Spacing Mode Evaluation
```
βββββββββββ€ββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ
β Label β Precision β Recall β Accuracy β F1 Score β
βββββββββββͺββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββ‘
β 0 β 0.994663 β 0.997324 β 0.997324 β 0.995992 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 1 β 0.989546 β 0.987828 β 0.987828 β 0.988686 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 2 β 0.913413 β 0.932125 β 0.932125 β 0.922674 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β Average β 0.965874 β 0.972426 β 0.972426 β 0.969117 β
βββββββββββ§ββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ
```
### Correction Mode Evaluation
```
βββββββββββ€ββββββββββββββ€βββββββββββ€βββββββββββββ€βββββββββββββ
β Label β Precision β Recall β Accuracy β F1 Score β
βββββββββββͺββββββββββββββͺβββββββββββͺβββββββββββββͺβββββββββββββ‘
β 0 β 0.995932 β 0.998386 β 0.998386 β 0.997157 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 1 β 0.992917 β 0.992227 β 0.992227 β 0.992572 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β 2 β 0.944612 β 0.959428 β 0.959428 β 0.951962 β
βββββββββββΌββββββββββββββΌβββββββββββΌβββββββββββββΌβββββββββββββ€
β Average β 0.97782 β 0.983347 β 0.983347 β 0.980564 β
βββββββββββ§ββββββββββββββ§βββββββββββ§βββββββββββββ§βββββββββββββ
```
Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.
### Label Meaning
- Label 0: No spacing needed
- Label 1: Regular space character needed
- Label 2: ZWNJ character (β) needed
## Use Cases
This model is particularly useful for:
- Correcting Persian text with improper spacing
- Normalizing text from different sources
- Improving text readability for downstream NLP tasks
- Preprocessing Persian text for search engines or text analysis
## Training
The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.
Training hyperparameters:
- Learning rate: [VALUE]
- Batch size: [VALUE]
- Training steps: [VALUE]
- [OTHER PARAMETERS]
## Limitations
- The model is specifically designed for Persian text
- Performance may vary on specialized domains or technical texts
- Very long texts should be processed in chunks for optimal performance
- Tuned for execution on devices with CUDA
- [ANY OTHER LIMITATIONS]
## Citation
```
[CITATION_INFO]
```
## License
[LICENSE_INFO]
## Contact
For questions or feedback, please contact [CONTACT_INFO]. |