DualStep-DropNet / README.md
matin-ebrahimkhani's picture
Upload the model
07b65ad verified
# BertForTokenClassificationWithFourO
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/USERNAME/MODEL_NAME)
A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.
## Model Description
This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.
### Task
The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:
- **Spacing Mode**: Uses pure model predictions to insert spaces
- **Correction Mode**: Combines model predictions with existing spacing in the text
### Model Architecture
The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:
- Dense layer with ReLU activation
- Dropout for regularization
- Batch normalization
- Output projection layer
## Usage
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoTokenizer
from modeling_custom import BertForTokenClassificationWithFourO
from labeler import Labeler
import torch
# Load model and tokenizer
model_path = "USERNAME/MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
model.eval()
# Initialize labeler
labeler = Labeler(tags=(1, 2),
regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
chars=(" ", "β€Œ"),
class_count=2)
# Process text
def process_text(text, mode="space"):
# Create a pipeline for processing
from run import ModelPipeline
pipeline = ModelPipeline(model_path)
result = pipeline.process_text(text, mode)
return result
# Example
text = "Ψ§ΫŒΩ† Ω…ΨͺΩ† Ω†Ω…ΩˆΩ†Ω‡ فارسی Ψ¨Ψ―ΩˆΩ† فاءله گذاری Ω…Ω†Ψ§Ψ³Ψ¨ Ψ§Ψ³Ψͺ"
result = process_text(text, mode="space")
print(result)
```
### Command-line Usage
You can also use the provided command-line interface:
```bash
python run.py --text "Ω…ΨͺΩ† فارسی Ψ΄Ω…Ψ§ Ψ―Ψ± Ψ§ΫŒΩ†Ψ¬Ψ§" --mode space
```
Or process a file:
```bash
python run.py --file input.txt --output result.txt --mode correct
```
The repository includes a sample `input.txt` file that you can use to test the model.
## Parameters
- `mode`:
- `space`: Uses model predictions to add spaces
- `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing)
## Evaluation
The model achieves excellent performance in both operating modes:
### Spacing Mode Evaluation
```
╒═════════╀═════════════╀══════════╀════════════╀════════════╕
β”‚ Label β”‚ Precision β”‚ Recall β”‚ Accuracy β”‚ F1 Score β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════β•ͺ════════════β•ͺ════════════║
β”‚ 0 β”‚ 0.994663 β”‚ 0.997324 β”‚ 0.997324 β”‚ 0.995992 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ 0.989546 β”‚ 0.987828 β”‚ 0.987828 β”‚ 0.988686 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ 0.913413 β”‚ 0.932125 β”‚ 0.932125 β”‚ 0.922674 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average β”‚ 0.965874 β”‚ 0.972426 β”‚ 0.972426 β”‚ 0.969117 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•›
```
### Correction Mode Evaluation
```
╒═════════╀═════════════╀══════════╀════════════╀════════════╕
β”‚ Label β”‚ Precision β”‚ Recall β”‚ Accuracy β”‚ F1 Score β”‚
β•žβ•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════β•ͺ════════════β•ͺ════════════║
β”‚ 0 β”‚ 0.995932 β”‚ 0.998386 β”‚ 0.998386 β”‚ 0.997157 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ 0.992917 β”‚ 0.992227 β”‚ 0.992227 β”‚ 0.992572 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ 0.944612 β”‚ 0.959428 β”‚ 0.959428 β”‚ 0.951962 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average β”‚ 0.97782 β”‚ 0.983347 β”‚ 0.983347 β”‚ 0.980564 β”‚
β•˜β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•§β•β•β•β•β•β•β•β•β•β•β•β•β•›
```
Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.
### Label Meaning
- Label 0: No spacing needed
- Label 1: Regular space character needed
- Label 2: ZWNJ character (β€Œ) needed
## Use Cases
This model is particularly useful for:
- Correcting Persian text with improper spacing
- Normalizing text from different sources
- Improving text readability for downstream NLP tasks
- Preprocessing Persian text for search engines or text analysis
## Training
The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.
Training hyperparameters:
- Learning rate: [VALUE]
- Batch size: [VALUE]
- Training steps: [VALUE]
- [OTHER PARAMETERS]
## Limitations
- The model is specifically designed for Persian text
- Performance may vary on specialized domains or technical texts
- Very long texts should be processed in chunks for optimal performance
- Tuned for execution on devices with CUDA
- [ANY OTHER LIMITATIONS]
## Citation
```
[CITATION_INFO]
```
## License
[LICENSE_INFO]
## Contact
For questions or feedback, please contact [CONTACT_INFO].