sandyyuan
/

deberta-v3-xsmall-sequence-classification

Safetensors

deberta-v2

Model card Files Files and versions

xet

Community

sandyyuan commited on Jul 25, 2025

Commit

6405ad7

verified ·

1 Parent(s): 846b8d8

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +106 -0

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# DeBERTa-v3-xsmall for Sequence Classification
+A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection.
+## Model Description
+This model achieves **95.9% macro F1 score** for categorizing text sequences into:
+- **NAME_FIRST** - First names
+- **NAME_LAST** - Last names
+- **GEO_STREET_NAME** - Street addresses
+- **GEO_CITY_NAME** - City names
+- **PROFESSION_JOB_TITLE** - Job titles
+- **PROFESSION_EMPLOYER** - Company/employer names
+- **MEDICAL_ALLERGY** - Medical allergies
+## Performance
+### F1 Scores by Category:
+- **GEO_STREET_NAME**: 99.8% (exceeds target by 9.8%)
+- **PROFESSION_JOB_TITLE**: 99.5% (exceeds target by 9.5%)
+- **MEDICAL_ALLERGY**: 99.5% (exceeds target by 9.5%)
+- **PROFESSION_EMPLOYER**: 98.7% (exceeds target by 8.7%)
+- **GEO_CITY_NAME**: 97.2% (exceeds target by 7.2%)
+- **NAME_FIRST**: 88.8% (close to target, -1.2%)
+- **NAME_LAST**: 87.9% (close to target, -2.1%)
+### Overall Performance:
+- **Macro F1 Score**: 95.9%
+- **Overall Accuracy**: 95.9%
+- **Categories Meeting 90% Target**: 5/7
+- **Processing Speed**: 29.5 cells/second
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import joblib
+import torch
+# Load model and tokenizer
+model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# For the label encoder, you'll need to download it separately
+# label_encoder = joblib.load("label_encoder.pkl")
+# Example inference
+text = "John Smith"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)
+with torch.no_grad():
+    outputs = model(**inputs)
+    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class_idx = torch.argmax(probabilities, dim=-1).item()
+    confidence = torch.max(probabilities).item()
+print(f"Predicted class index: {predicted_class_idx}")
+print(f"Confidence: {confidence:.3f}")
+```
+## Training Details
+- **Base Model**: microsoft/deberta-v3-xsmall (70M parameters)
+- **Training Framework**: PyTorch with Hugging Face Transformers
+- **Optimization**: Class weighting for imbalanced data
+- **Sequence Length**: 64 tokens (optimized)
+- **Epochs**: 2 with early stopping
+- **Batch Size**: 16
+- **Learning Rate**: 3e-05
+## Intended Use
+This model is designed for:
+- Sensitive data detection in text sequences
+- PII (Personally Identifiable Information) classification
+- Data governance and compliance applications
+- Privacy-focused text analysis
+## Limitations
+- NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target
+- Model is trained on English text only
+- Performance may vary on data distributions different from training set
+## Model Architecture
+Based on DeBERTa-v3-xsmall architecture with:
+- Enhanced relative position encoding
+- Disentangled attention mechanism
+- Optimized for efficiency (35% smaller than BERT-base)
+## Citation
+If you use this model, please cite:
+```
+@misc{deberta-sequence-classification-2025,
+  title={DeBERTa-v3-xsmall for Sequence Classification},
+  author={Sandy Yuan},
+  year={2025},
+  url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification}
+}
+```