DualStep-DropNet / README.md

Upload the model

07b65ad verified 10 months ago

6.96 kB

	# BertForTokenClassificationWithFourO

	[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/USERNAME/MODEL_NAME)

	A specialized token classification model built on BERT with a custom classifier for Persian text spacing and formatting tasks.

	## Model Description

	This model is built on a BERT architecture with a custom token classification head called FourOClassifier. It's specifically designed for processing Persian text to correct or add proper spacing characters.

	### Task

	The model performs token classification to detect where spacing characters should be inserted in Persian text. It can operate in two modes:
	- Spacing Mode: Uses pure model predictions to insert spaces
	- Correction Mode: Combines model predictions with existing spacing in the text

	### Model Architecture

	The model is based on the BERT architecture with a custom classifier head (FourOClassifier) that includes:
	- Dense layer with ReLU activation
	- Dropout for regularization
	- Batch normalization
	- Output projection layer

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import AutoTokenizer
	from modeling_custom import BertForTokenClassificationWithFourO
	from labeler import Labeler
	import torch

	# Load model and tokenizer
	model_path = "USERNAME/MODEL_NAME"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = BertForTokenClassificationWithFourO.from_pretrained(model_path)
	model.eval()

	# Initialize labeler
	labeler = Labeler(tags=(1, 2),
	regexes=(r'[^\S\r\n\v\f]', r'\u200c'),
	chars=(" ", "‌"),
	class_count=2)

	# Process text
	def process_text(text, mode="space"):
	# Create a pipeline for processing
	from run import ModelPipeline
	pipeline = ModelPipeline(model_path)
	result = pipeline.process_text(text, mode)
	return result

	# Example
	text = "این متن نمونه فارسی بدون فاصله گذاری مناسب است"
	result = process_text(text, mode="space")
	print(result)
	```

	### Command-line Usage

	You can also use the provided command-line interface:

	```bash
	python run.py --text "متن فارسی شما در اینجا" --mode space
	```

	Or process a file:

	```bash
	python run.py --file input.txt --output result.txt --mode correct
	```

	The repository includes a sample `input.txt` file that you can use to test the model.

	## Parameters

	- `mode`:
	- `space`: Uses model predictions to add spaces
	- `correct`: Combines model predictions with original text spacing (recommended for texts with some correct spacing)

	## Evaluation

	The model achieves excellent performance in both operating modes:

	### Spacing Mode Evaluation

	```
	╒═════════╤═════════════╤══════════╤════════════╤════════════╕
	│ Label │ Precision │ Recall │ Accuracy │ F1 Score │
	╞═════════╪═════════════╪══════════╪════════════╪════════════╡
	│ 0 │ 0.994663 │ 0.997324 │ 0.997324 │ 0.995992 │
	├─────────┼─────────────┼──────────┼────────────┼────────────┤
	│ 1 │ 0.989546 │ 0.987828 │ 0.987828 │ 0.988686 │
	├─────────┼─────────────┼──────────┼────────────┼────────────┤
	│ 2 │ 0.913413 │ 0.932125 │ 0.932125 │ 0.922674 │
	├─────────┼─────────────┼──────────┼────────────┼────────────┤
	│ Average │ 0.965874 │ 0.972426 │ 0.972426 │ 0.969117 │
	╘═════════╧═════════════╧══════════╧════════════╧════════════╛
	```

	### Correction Mode Evaluation

	```
	╒═════════╤═════════════╤══════════╤════════════╤════════════╕
	│ Label │ Precision │ Recall │ Accuracy │ F1 Score │
	╞═════════╪═════════════╪══════════╪════════════╪════════════╡
	│ 0 │ 0.995932 │ 0.998386 │ 0.998386 │ 0.997157 │
	├─────────┼─────────────┼──────────┼────────────┼────────────┤
	│ 1 │ 0.992917 │ 0.992227 │ 0.992227 │ 0.992572 │
	├─────────┼─────────────┼──────────┼────────────┼────────────┤
	│ 2 │ 0.944612 │ 0.959428 │ 0.959428 │ 0.951962 │
	├─────────┼─────────────┼──────────┼────────────┼────────────┤
	│ Average │ 0.97782 │ 0.983347 │ 0.983347 │ 0.980564 │
	╘═════════╧═════════════╧══════════╧════════════╧════════════╛
	```

	Note that the correction mode achieves slightly better results by combining model predictions with existing text spacing.

	### Label Meaning
	- Label 0: No spacing needed
	- Label 1: Regular space character needed
	- Label 2: ZWNJ character (‌) needed

	## Use Cases

	This model is particularly useful for:
	- Correcting Persian text with improper spacing
	- Normalizing text from different sources
	- Improving text readability for downstream NLP tasks
	- Preprocessing Persian text for search engines or text analysis

	## Training

	The model was trained on [DATASET_NAME] of Persian text with proper spacing annotations.

	Training hyperparameters:
	- Learning rate: [VALUE]
	- Batch size: [VALUE]
	- Training steps: [VALUE]
	- [OTHER PARAMETERS]

	## Limitations

	- The model is specifically designed for Persian text
	- Performance may vary on specialized domains or technical texts
	- Very long texts should be processed in chunks for optimal performance
	- Tuned for execution on devices with CUDA
	- [ANY OTHER LIMITATIONS]

	## Citation

	```
	[CITATION_INFO]
	```

	## License

	[LICENSE_INFO]

	## Contact

	For questions or feedback, please contact [CONTACT_INFO].