| # DeBERTa-v3-xsmall for Sequence Classification | |
| A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection. | |
| ## Model Description | |
| This model achieves **95.9% macro F1 score** for categorizing text sequences into: | |
| - **NAME_FIRST** - First names | |
| - **NAME_LAST** - Last names | |
| - **GEO_STREET_NAME** - Street addresses | |
| - **GEO_CITY_NAME** - City names | |
| - **PROFESSION_JOB_TITLE** - Job titles | |
| - **PROFESSION_EMPLOYER** - Company/employer names | |
| - **MEDICAL_ALLERGY** - Medical allergies | |
| ## Performance | |
| ### F1 Scores by Category: | |
| - **GEO_STREET_NAME**: 99.8% (exceeds target by 9.8%) | |
| - **PROFESSION_JOB_TITLE**: 99.5% (exceeds target by 9.5%) | |
| - **MEDICAL_ALLERGY**: 99.5% (exceeds target by 9.5%) | |
| - **PROFESSION_EMPLOYER**: 98.7% (exceeds target by 8.7%) | |
| - **GEO_CITY_NAME**: 97.2% (exceeds target by 7.2%) | |
| - **NAME_FIRST**: 88.8% (close to target, -1.2%) | |
| - **NAME_LAST**: 87.9% (close to target, -2.1%) | |
| ### Overall Performance: | |
| - **Macro F1 Score**: 95.9% | |
| - **Overall Accuracy**: 95.9% | |
| - **Categories Meeting 90% Target**: 5/7 | |
| - **Processing Speed**: 29.5 cells/second | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import joblib | |
| import torch | |
| # Load model and tokenizer | |
| model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| # For the label encoder, you'll need to download it separately | |
| # label_encoder = joblib.load("label_encoder.pkl") | |
| # Example inference | |
| text = "John Smith" | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) | |
| predicted_class_idx = torch.argmax(probabilities, dim=-1).item() | |
| confidence = torch.max(probabilities).item() | |
| print(f"Predicted class index: {predicted_class_idx}") | |
| print(f"Confidence: {confidence:.3f}") | |
| ``` | |
| ## Training Details | |
| - **Base Model**: microsoft/deberta-v3-xsmall (70M parameters) | |
| - **Training Framework**: PyTorch with Hugging Face Transformers | |
| - **Optimization**: Class weighting for imbalanced data | |
| - **Sequence Length**: 64 tokens (optimized) | |
| - **Epochs**: 2 with early stopping | |
| - **Batch Size**: 16 | |
| - **Learning Rate**: 3e-05 | |
| ## Intended Use | |
| This model is designed for: | |
| - Sensitive data detection in text sequences | |
| - PII (Personally Identifiable Information) classification | |
| - Data governance and compliance applications | |
| - Privacy-focused text analysis | |
| ## Limitations | |
| - NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target | |
| - Model is trained on English text only | |
| - Performance may vary on data distributions different from training set | |
| ## Model Architecture | |
| Based on DeBERTa-v3-xsmall architecture with: | |
| - Enhanced relative position encoding | |
| - Disentangled attention mechanism | |
| - Optimized for efficiency (35% smaller than BERT-base) | |
| ## Citation | |
| If you use this model, please cite: | |
| ``` | |
| @misc{deberta-sequence-classification-2025, | |
| title={DeBERTa-v3-xsmall for Sequence Classification}, | |
| author={Sandy Yuan}, | |
| year={2025}, | |
| url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification} | |
| } | |
| ``` | |