sandyyuan
/

deberta-v3-xsmall-sequence-classification

Model card Files Files and versions

deberta-v3-xsmall-sequence-classification / README.md

sandyyuan's picture

Upload README.md with huggingface_hub

6405ad7 verified 6 months ago

|

history blame contribute delete

3.31 kB


	# DeBERTa-v3-xsmall for Sequence Classification

	A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection.

	## Model Description

	This model achieves 95.9% macro F1 score for categorizing text sequences into:

	- NAME_FIRST - First names
	- NAME_LAST - Last names
	- GEO_STREET_NAME - Street addresses
	- GEO_CITY_NAME - City names
	- PROFESSION_JOB_TITLE - Job titles
	- PROFESSION_EMPLOYER - Company/employer names
	- MEDICAL_ALLERGY - Medical allergies

	## Performance

	### F1 Scores by Category:
	- GEO_STREET_NAME: 99.8% (exceeds target by 9.8%)
	- PROFESSION_JOB_TITLE: 99.5% (exceeds target by 9.5%)
	- MEDICAL_ALLERGY: 99.5% (exceeds target by 9.5%)
	- PROFESSION_EMPLOYER: 98.7% (exceeds target by 8.7%)
	- GEO_CITY_NAME: 97.2% (exceeds target by 7.2%)
	- NAME_FIRST: 88.8% (close to target, -1.2%)
	- NAME_LAST: 87.9% (close to target, -2.1%)

	### Overall Performance:
	- Macro F1 Score: 95.9%
	- Overall Accuracy: 95.9%
	- Categories Meeting 90% Target: 5/7
	- Processing Speed: 29.5 cells/second

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import joblib
	import torch

	# Load model and tokenizer
	model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# For the label encoder, you'll need to download it separately
	# label_encoder = joblib.load("label_encoder.pkl")

	# Example inference
	text = "John Smith"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)

	with torch.no_grad():
	outputs = model(**inputs)
	probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class_idx = torch.argmax(probabilities, dim=-1).item()
	confidence = torch.max(probabilities).item()

	print(f"Predicted class index: {predicted_class_idx}")
	print(f"Confidence: {confidence:.3f}")
	```

	## Training Details

	- Base Model: microsoft/deberta-v3-xsmall (70M parameters)
	- Training Framework: PyTorch with Hugging Face Transformers
	- Optimization: Class weighting for imbalanced data
	- Sequence Length: 64 tokens (optimized)
	- Epochs: 2 with early stopping
	- Batch Size: 16
	- Learning Rate: 3e-05

	## Intended Use

	This model is designed for:
	- Sensitive data detection in text sequences
	- PII (Personally Identifiable Information) classification
	- Data governance and compliance applications
	- Privacy-focused text analysis

	## Limitations

	- NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target
	- Model is trained on English text only
	- Performance may vary on data distributions different from training set

	## Model Architecture

	Based on DeBERTa-v3-xsmall architecture with:
	- Enhanced relative position encoding
	- Disentangled attention mechanism
	- Optimized for efficiency (35% smaller than BERT-base)

	## Citation

	If you use this model, please cite:

	```
	@misc{deberta-sequence-classification-2025,
	title={DeBERTa-v3-xsmall for Sequence Classification},
	author={Sandy Yuan},
	year={2025},
	url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification}
	}
	```