# DeBERTa-v3-xsmall for Sequence Classification A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection. ## Model Description This model achieves **95.9% macro F1 score** for categorizing text sequences into: - **NAME_FIRST** - First names - **NAME_LAST** - Last names - **GEO_STREET_NAME** - Street addresses - **GEO_CITY_NAME** - City names - **PROFESSION_JOB_TITLE** - Job titles - **PROFESSION_EMPLOYER** - Company/employer names - **MEDICAL_ALLERGY** - Medical allergies ## Performance ### F1 Scores by Category: - **GEO_STREET_NAME**: 99.8% (exceeds target by 9.8%) - **PROFESSION_JOB_TITLE**: 99.5% (exceeds target by 9.5%) - **MEDICAL_ALLERGY**: 99.5% (exceeds target by 9.5%) - **PROFESSION_EMPLOYER**: 98.7% (exceeds target by 8.7%) - **GEO_CITY_NAME**: 97.2% (exceeds target by 7.2%) - **NAME_FIRST**: 88.8% (close to target, -1.2%) - **NAME_LAST**: 87.9% (close to target, -2.1%) ### Overall Performance: - **Macro F1 Score**: 95.9% - **Overall Accuracy**: 95.9% - **Categories Meeting 90% Target**: 5/7 - **Processing Speed**: 29.5 cells/second ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import joblib import torch # Load model and tokenizer model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # For the label encoder, you'll need to download it separately # label_encoder = joblib.load("label_encoder.pkl") # Example inference text = "John Smith" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64) with torch.no_grad(): outputs = model(**inputs) probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class_idx = torch.argmax(probabilities, dim=-1).item() confidence = torch.max(probabilities).item() print(f"Predicted class index: {predicted_class_idx}") print(f"Confidence: {confidence:.3f}") ``` ## Training Details - **Base Model**: microsoft/deberta-v3-xsmall (70M parameters) - **Training Framework**: PyTorch with Hugging Face Transformers - **Optimization**: Class weighting for imbalanced data - **Sequence Length**: 64 tokens (optimized) - **Epochs**: 2 with early stopping - **Batch Size**: 16 - **Learning Rate**: 3e-05 ## Intended Use This model is designed for: - Sensitive data detection in text sequences - PII (Personally Identifiable Information) classification - Data governance and compliance applications - Privacy-focused text analysis ## Limitations - NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target - Model is trained on English text only - Performance may vary on data distributions different from training set ## Model Architecture Based on DeBERTa-v3-xsmall architecture with: - Enhanced relative position encoding - Disentangled attention mechanism - Optimized for efficiency (35% smaller than BERT-base) ## Citation If you use this model, please cite: ``` @misc{deberta-sequence-classification-2025, title={DeBERTa-v3-xsmall for Sequence Classification}, author={Sandy Yuan}, year={2025}, url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification} } ```